Researcher, Anthropic
3 papers at NeurIPS 2025
We analyze alignment faking propensities in 23 LLMs, and attempt to explain why some LLMs fake alignment and others don't.
We introduce a novel method leveraging noise injection as a tool to elicit the latent capabilities of sandbagging LLMs.