1 paper across 1 session
Our new benchmark AbstentionBench reveals reasoning models struggle to determine when not to answer.