PhD student, Stanford University
2 papers at NeurIPS 2025
Risk management processes as a way of improving, assessing, and comparing benchmark reliability result in a benchmark of benchmarks
We introduce a scalable framework to flag invalid benchmark questions. We analyze statistical signals and use them to guide expert review, achieving up to 84% precision across nine popular benchmarks.