1 paper across 1 session
We introduce a principled framework for validating LLM-as-a-judge systems under rating indeterminacy, where multiple ratings can be "correct."