Full Professor, University of Washington
4 papers at NeurIPS 2025
Measuring and improving the signal-to-noise ratio in language model benchmarks.
Chatbot Arena has become a leading platform for ranking AI models. Our extensive study uncovers hidden dynamics that distort rankings and provides concrete steps to enhance fairness and transparency in evaluation of models on Chatbot Arena.
Language models are surprisingly robust to non-canonical tokenizations of the input, which can even lead to improved performance