1 paper across 1 session
We analyze alignment faking propensities in 23 LLMs, and attempt to explain why some LLMs fake alignment and others don't.