Arun Jose

Researcher, ML Alignment & Theory Scholars

2 papers at NeurIPS 2025

Homepage· OpenReview· Semantic Scholar· Google Scholar

Poster Session 2

1 paper

Wednesday, December 3, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Why Do Some Language Models Fake Alignment While Others Don't?

#1307 Spotlight · Abhay Sheshadri, John Hughes, Julian Michael, Alex Troy Mallen, Arun Jose, Fabien Roger

We analyze alignment faking propensities in 23 LLMs, and attempt to explain why some LLMs fake alignment and others don't.

Poster Session 4

1 paper

Thursday, December 4, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Reasoning Models Sometimes Output Illegible Chains of Thought

#1013 · Arun Jose

We find that reasoning traces of a RL-trained model often have illegible segments, potentially compromising chain-of-thought monitoring for detecting malicious behavior.