PhD student, Department of Computer Science and Technology, Tsinghua University
2 papers at NeurIPS 2025
In this paper, we interpret the mechanism behind safety alignment via neurons and analyze their properties.