Full Professor, Massachusetts Institute of Technology
4 papers at NeurIPS 2025
We investigate some basic questions about how neural networks learn and represent skills that are relevant to the problem of creating narrow AI systems.
A synthesis of layer-wise interventions, empirical experiments, and previous research suggests that inference in decoder-only LLMs unfolds in distinct phases.
We introduce a quantitative framework to model and optimize scalable oversight—where weaker AI systems supervise stronger ones—showing diminishing oversight success as capability gaps widen across multiple oversight levels.