2 papers across 2 sessions
We analyze alignment faking propensities in 23 LLMs, and attempt to explain why some LLMs fake alignment and others don't.
Optimally constructing monitoring protocols with multiple monitors of varying costs and performances.