5 papers across 3 sessions
By carefully coordinating off-the-shelf models with inference only, we show the DSP framework can achieve surprisingly good results in theorem proving, comparable to the frontier models with RL-based large-scale training.
We introduce IneqMath, an informal inequality proving benchmark, and an LLM-as-judge suite, revealing that top LLMs achieve <10% overall accuracy due to flawed step-wise reasoning.
Learning proof system dynamics, pruning proof search based on diversity and expected outcome
We introduce Ineq-Comp, a benchmark for testing compositional reasoning in formal inequality proving. Simple human-intuitive transformations cause major accuracy drops, showing that current LLM provers lack robust compositional generalization.