Researcher, ML Alignment & Theory Scholars
2 papers at NeurIPS 2025
We analyze alignment faking propensities in 23 LLMs, and attempt to explain why some LLMs fake alignment and others don't.
We find that reasoning traces of a RL-trained model often have illegible segments, potentially compromising chain-of-thought monitoring for detecting malicious behavior.