logo
today local_bar
Poster Session 6 · Friday, December 5, 2025 4:30 PM → 7:30 PM
#1008

Parameterized Synthetic Text Generation with SimpleStories

NeurIPS Project Page Poster OpenReview

Abstract

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity.
Ablations on a newly trained tiny model suite then show improved sample efficiency and model interpretability in comparison with the TinyStories dataset.
We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier with regards to the fewest-parameter language model that outputs grammatical English.
Poster