Alignment faking - NeurIPS 2025

today local_bar

Alignment faking

1 paper across 1 session

Poster Session 2

Wednesday, December 3, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Why Do Some Language Models Fake Alignment While Others Don't?

#1307 Spotlight · Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Fabien Roger

We analyze alignment faking propensities in 23 LLMs, and attempt to explain why some LLMs fake alignment and others don't.