2 papers across 2 sessions
In this position paper we investigate the validity and reliability of LLMs as judges and highlight challenges inherent to their use and existing practices in NLG evaluation.
We argue that conclusions drawn about relative system safety or attack method efficacy via AI red teaming are often not supported by evidence provided by attack success rate (ASR) comparisons.