2 papers across 2 sessions
We evaluate frontier LM agents' capabilities to sabotage and sandbag ML engineering tasks without being detected by automated monitors.
We introduce a novel method leveraging noise injection as a tool to elicit the latent capabilities of sandbagging LLMs.