3 papers across 2 sessions
We incorporate lie detectors into the labelling step of preference learning and characterize the factors that lead the trained policy to be honest or to evade the detector.
We introduce a quantitative framework to model and optimize scalable oversight—where weaker AI systems supervise stronger ones—showing diminishing oversight success as capability gaps widen across multiple oversight levels.
Debate between AI experts outperforms single-advisor consultancy in helping humans make more accurate factual judgments, especially benefiting those with mainstream beliefs.