Assistant Professor, Tsinghua University
3 papers at NeurIPS 2025
We introduce DeceptionBench, the first comprehensive benchmark evaluating deceptive behaviors in LLMs across real-world scenarios, revealing critical vulnerabilities especially under reinforcement dynamics.
We introduce a noval framework for red-teaming black-box T2I systems, termed Rule-based Preference modeling Guided Red-Teaming (RPG-RT).
We propose manifold steering that projects the steering direction of model overthinking on the low-dimensional activation manifold, effectively reducing output tokens while maintaining accuracy.