3 papers across 3 sessions
We present G-Vendi, a data diversity measure that strongly correlates with LLM reasoning generalization in OOD benchmarks; we use this insight to diverse synthetic reasoning data, which leads to SOTA distilled models in NLI and math reasoning.
We compare leading open SFT datasets, add quality annotations using MagPie, and design curation recipes leading to a high-performing leaner SFT mixture
Nemotron-CLIMB automates data mixture optimization for pre-training, improving domain adaptation and outperforming Llama-3.2-1B by 2.0% on general reasoning.