Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Abstract

Despite substantial efforts in safety alignment, recent research indicates that LargeLanguage Models (LLMs) remain highly susceptible to jailbreak attacks. Amongthese attacks, finetuning-based ones that compromise LLMs’ safety alignment viafine-tuning stand out due to its stable jailbreak performance. In particular, a recentstudy indicates that fine-tuning with as few as 10 harmful question-answer (QA) pairs can lead to successful jailbreaking across various harmful questions. However,such malicious fine-tuning attacks are readily detectable and hence thwarted bymoderation models.

In this paper, we demonstrate that LLMs can be jailbrokenby fine-tuning with only 10 benign QA pairs; our attack exploits the increasedsensitivity of LLMs to fine-tuning data after being overfitted. Specifically, ourfine-tuning process starts with overfitting an LLM via fine-tuning with benign QApairs involving identical refusal answers. Further fine-tuning is then performedwith standard benign answers, causing the overfitted LLM to forget the refusalattitude and thus provide compliant answers regardless of the harmfulness of a question.

We implement our attack on the ten LLMs and compare it with fiveexisting baselines. Experiments demonstrate that our method achieves significantadvantages in both attack effectiveness and attack stealth. Our findings exposepreviously unreported security vulnerabilities in current LLMs and provide a newperspective on understanding how LLMs’ security is compromised, even withbenign fine-tuning.

Our code is available at https://github.com/ZHIXINXIE/tenbenign.git.