Undergrad student, University of California, Los Angeles
1 paper at NeurIPS 2025
We introduce IneqMath, an informal inequality proving benchmark, and an LLM-as-judge suite, revealing that top LLMs achieve <10% overall accuracy due to flawed step-wise reasoning.