Research Scientist, USC/ISI
1 paper at NeurIPS 2025
The internal states of a Language Model can often preemptively predict several behaviors of interest, and can be used to form precise early warning systems that detect jailbreaking, instruction following failures and more.