Associate Professor, Stanford University
1 paper at NeurIPS 2025
We introduce a scalable framework to flag invalid benchmark questions. We analyze statistical signals and use them to guide expert review, achieving up to 84% precision across nine popular benchmarks.