LLMs Encode Harmfulness and Refusal Separately

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi

Large language models; Interpretability; AI safety

Abstract

LLMs are trained to refuse harmful instructions, but do they truly understandharmfulness beyond just refusing? Prior work has shown that LLMs’ refusalbehaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. And there exists a harmfulness direction that is distinct from the refusal direction.

As causal evidence, steering along the harmfulness direction can lead LLMs tointerpret harmless instructions as harmful, but steering along the refusal directiontends to elicit refusal responses directly without reversing the model’s judgment onharmfulness. Furthermore, using our identified harmfulness concept, we find thatcertain jailbreak methods work by reducing the refusal signals without suppressingthe model’s internal belief of harmfulness. We also find that adversarially fine-tuning models to accept harmful instructions has minimal impact on the model’s internal belief of harmfulness.

These insights lead to a practical safety application: The model’s latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguardmodel, across different jailbreak methods.

Our findings suggest that LLMs’internal understanding of harmfulness is more robust than their refusal decisionto diverse input instructions, offering a new perspective to study AI safety.