3 papers across 3 sessions
We evaluate frontier LM agents' capabilities to sabotage and sandbag ML engineering tasks without being detected by automated monitors.
We show that penalizing certain CoT reasoning makes LLMs learn encoding schemes that generalize to unseen examples.