Researcher, Anthropic
1 paper at NeurIPS 2025
We introduce a quantitative framework to model and optimize scalable oversight—where weaker AI systems supervise stronger ones—showing diminishing oversight success as capability gaps widen across multiple oversight levels.