4 papers across 3 sessions
We analyze alignment faking propensities in 23 LLMs, and attempt to explain why some LLMs fake alignment and others don't.
Debate between AI experts outperforms single-advisor consultancy in helping humans make more accurate factual judgments, especially benefiting those with mainstream beliefs.