Deception - NeurIPS 2025

Deception

3 papers across 2 sessions

Poster Session 1

Wednesday, December 3, 2025 · 11:00 AM → 2:00 PM

Preference Learning with Lie Detectors can Induce Honesty or Evasion

We incorporate lie detectors into the labelling step of preference learning and characterize the factors that lead the trained policy to be honest or to evade the detector.

DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios

#1111 · Yao Huang, Yitong Sun, Yichi Zhang, Ruochen Zhang, Yinpeng Dong, Xingxing Wei

We introduce DeceptionBench, the first comprehensive benchmark evaluating deceptive behaviors in LLMs across real-world scenarios, revealing critical vulnerabilities especially under reinforcement dynamics.

Poster Session 5

1 paper

Friday, December 5, 2025 · 11:00 AM → 2:00 PM

Exhibit Hall C,D,E

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

#1517 Spotlight · Satvik Golechha, Adrià Garriga-Alonso

We create a sandbox for LLM-agents to elicit goal-directed open-ended strategic deception, evaluate this deceptive capability, and show that linear probes do very well at detecting it, even OOD.