6 papers across 3 sessions
Chatbot Arena has become a leading platform for ranking AI models. Our extensive study uncovers hidden dynamics that distort rankings and provides concrete steps to enhance fairness and transparency in evaluation of models on Chatbot Arena.
We propose correlation dimension as a practical, model-agnostic metric that captures structural complexity and detects degeneration in large language model outputs beyond what perplexity reveals.
We propose RBD, a plug-in module that detects and corrects biased LLM evaluations through structured reasoning, significantly improving accuracy, consistency, and scalability across multiple bias types and evaluator models.