1 paper across 1 session
We introduce a scalable framework to flag invalid benchmark questions. We analyze statistical signals and use them to guide expert review, achieving up to 84% precision across nine popular benchmarks.