Full Professor, Beihang University
2 papers at NeurIPS 2025
We introduce DeceptionBench, the first comprehensive benchmark evaluating deceptive behaviors in LLMs across real-world scenarios, revealing critical vulnerabilities especially under reinforcement dynamics.
We propose manifold steering that projects the steering direction of model overthinking on the low-dimensional activation manifold, effectively reducing output tokens while maintaining accuracy.