3 papers across 2 sessions
We incorporate lie detectors into the labelling step of preference learning and characterize the factors that lead the trained policy to be honest or to evade the detector.
We introduce DeceptionBench, the first comprehensive benchmark evaluating deceptive behaviors in LLMs across real-world scenarios, revealing critical vulnerabilities especially under reinforcement dynamics.
We create a sandbox for LLM-agents to elicit goal-directed open-ended strategic deception, evaluate this deceptive capability, and show that linear probes do very well at detecting it, even OOD.