Data Diversity

1 paper across 1 session

Poster Session 1

Wednesday, December 3, 2025 · 11:00 AM → 2:00 PM

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

#116 Spotlight · Jaehun Jung, Seungju Han, Ximing Lu, Skyler Hallinan, David Acuna, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

We present G-Vendi, a data diversity measure that strongly correlates with LLM reasoning generalization in OOD benchmarks; we use this insight to diverse synthetic reasoning data, which leads to SOTA distilled models in NLI and math reasoning.