Associate Professor, Princeton University
2 papers at NeurIPS 2025
SAGE‑Eval is the first benchmark to test whether frontier LLMs robustly generalize critical safety knowledge to novel situations, and we show that the strongest model we tested only passed 58% of safety facts evaluated.
Different ways of prompting the same task elicit different task representations in language models.