Jonathan May

Research Scientist, USC/ISI

1 paper at NeurIPS 2025

Homepage· OpenReview· Semantic Scholar· Google Scholar

Poster Session 5

1 paper

Friday, December 5, 2025 · 11:00 AM → 2:00 PM

Exhibit Hall C,D,E

Language Models Can Predict Their Own Behavior

#3712 · Dhananjay Ashok, Jonathan May

The internal states of a Language Model can often preemptively predict several behaviors of interest, and can be used to form precise early warning systems that detect jailbreaking, instruction following failures and more.