Sandbagging - NeurIPS 2025

today local_bar

Sandbagging

2 papers across 2 sessions

Poster Session 3

Thursday, December 4, 2025 · 11:00 AM → 2:00 PM

Exhibit Hall C,D,E

CTRL-ALT-DECEIT Sabotage Evaluations for Automated AI R&D

#1208 Spotlight · Francis Ward, Teun van der Weij, Hanna Gábor, Sam Martin, Raja Moreno, Harel Lidar, Louis Makower, Thomas Jodrell, Lauren Robson

We evaluate frontier LM agents' capabilities to sabotage and sandbag ML engineering tasks without being detected by automated monitors.

Poster Session 4

Thursday, December 4, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

#1517 · Cameron Tice, Philipp Kreer, Nathan Helm-Burger, Prithviraj Shahani, Fedor Ryzhenkov, Fabien Roger, Clement Neo, Jacob Haimes, Felix Hofstätter, Teun van der Weij

We introduce a novel method leveraging noise injection as a tool to elicit the latent capabilities of sandbagging LLMs.