2 papers across 2 sessions
We introduce a principled framework for validating LLM-as-a-judge systems under rating indeterminacy, where multiple ratings can be "correct."
We argue that conclusions drawn about relative system safety or attack method efficacy via AI red teaming are often not supported by evidence provided by attack success rate (ASR) comparisons.