Undergrad student, Peking University
2 papers at NeurIPS 2025
Masked Diffusion Models can generate low-perplexity text efficiently, but can not handle tasks requiring high accuracy efficiently.
We introduce Ineq-Comp, a benchmark for testing compositional reasoning in formal inequality proving. Simple human-intuitive transformations cause major accuracy drops, showing that current LLM provers lack robust compositional generalization.