1 paper across 1 session
We present a novel pair of benchmarks to evaluate the fundamental deductive reasoning abilities of test-time compute reasoning models on program semantics.