MS student, New York University
2 papers at NeurIPS 2025
We build a LM-based system that can outperfrom expert AI researchers in predicting the outcomes of empirical AI research ideas, without running actual experiments.
SAGE‑Eval is the first benchmark to test whether frontier LLMs robustly generalize critical safety knowledge to novel situations, and we show that the strongest model we tested only passed 58% of safety facts evaluated.