4 papers across 3 sessions
We create a sandbox for LLM-agents to elicit goal-directed open-ended strategic deception, evaluate this deceptive capability, and show that linear probes do very well at detecting it, even OOD.
We show that penalizing certain CoT reasoning makes LLMs learn encoding schemes that generalize to unseen examples.
VLMs often perform worse at recalling facts than their LLM backbones because visual representations are formed too late in the forward pass to trigger the LLMs factual recall circuit.