2 papers across 2 sessions
Our new benchmark AbstentionBench reveals reasoning models struggle to determine when not to answer.
We decompose the gap between selective classifiers and the ideal oracle into five measurable sources, showing that only non-monotone scoring methods can reduce it and improve reliability.