Undergrad student, University of California, Davis
1 paper at NeurIPS 2025
We present a novel pair of benchmarks to evaluate the fundamental deductive reasoning abilities of test-time compute reasoning models on program semantics.