scalable oversight

3 papers across 2 sessions

Poster Session 1

Wednesday, December 3, 2025 · 11:00 AM → 2:00 PM

Preference Learning with Lie Detectors can Induce Honesty or Evasion

We incorporate lie detectors into the labelling step of preference learning and characterize the factors that lead the trained policy to be honest or to evade the detector.

Poster Session 4

2 papers

Thursday, December 4, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Scaling Laws For Scalable Oversight

#1207 Spotlight · Joshua Engels, David Baek, Subhash Kantamneni, Max Tegmark

We introduce a quantitative framework to model and optimize scalable oversight—where weaker AI systems supervise stronger ones—showing diminishing oversight success as capability gaps widen across multiple oversight levels.

AI Debate Aids Assessment of Controversial Claims

#1410 · Salman Rahman, Sheriff Issaka, Ashima Suvarna, Genglin Liu, James Shiffer, jaeyoung lee, Md Rizwan Parvez, Hamid Palangi, Shi Feng, Nanyun Peng, Yejin Choi, Julian Michael, Liwei Jiang, Saadia Gabriel

Debate between AI experts outperforms single-advisor consultancy in helping humans make more accurate factual judgments, especially benefiting those with mainstream beliefs.