Assistant Professor, Tsinghua University
2 papers at NeurIPS 2025
In this paper, we interpret the mechanism behind safety alignment via neurons and analyze their properties.
We propose a benchmark to evaluate the large language models' instruction following ability in agentic scenarios.