4 papers across 2 sessions
We evaluate frontier LM agents' capabilities to sabotage and sandbag ML engineering tasks without being detected by automated monitors.
We introduce a quantitative framework to model and optimize scalable oversight—where weaker AI systems supervise stronger ones—showing diminishing oversight success as capability gaps widen across multiple oversight levels.
Safe RLHF-V, the multimodal safety alignment framework.
We propose BPO; a generalized DPO objective based on Bregman divergence from the perspective of likelihood ratio estimation.