AI Alignment

4 papers across 2 sessions

Poster Session 3

Thursday, December 4, 2025 · 11:00 AM → 2:00 PM

CTRL-ALT-DECEIT Sabotage Evaluations for Automated AI R&D

#1208 Spotlight · Francis Ward, Teun van der Weij, Hanna Gábor, Sam Martin, Raja Moreno, Harel Lidar, Louis Makower, Thomas Jodrell, Lauren Robson

We evaluate frontier LM agents' capabilities to sabotage and sandbag ML engineering tasks without being detected by automated monitors.

Poster Session 4

3 papers

Thursday, December 4, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Scaling Laws For Scalable Oversight

#1207 Spotlight · Joshua Engels, David Baek, Subhash Kantamneni, Max Tegmark

We introduce a quantitative framework to model and optimize scalable oversight—where weaker AI systems supervise stronger ones—showing diminishing oversight success as capability gaps widen across multiple oversight levels.

Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback

#5305 · Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, Juntao Dai, Chi-Min Chan, Sirui Han, Yike Guo, Yaodong Yang

Safe RLHF-V, the multimodal safety alignment framework.

Preference Optimization by Estimating the Ratio of the Data Distribution

#3607 · Yeongmin Kim, HeeSun Bae, Byeonghu Na, Il-chul Moon

We propose BPO; a generalized DPO objective based on Bregman divergence from the perspective of likelihood ratio estimation.