Jailbreak Attacks

3 papers across 2 sessions

Poster Session 3

Thursday, December 4, 2025 · 11:00 AM → 2:00 PM

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

#1403 · Csaba Dékány, Stefan Balauca, Dimitar I. Dimitrov, Robin Staab, Martin Vechev

We mix discrete and continuous adversarial attacks to adversarially train more robust LLMs. We evaluate our models in different realistic inference settings and show that they are more robust while matching the training cost of other SoTA models.

GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs

#1306 · Advik Basani, Xiao Zhang

GASP is a novel black-box attack framework that efficiently explores the latent space to generate human-readable adversarial suffixes, significantly improving jailbreak success rates while maintaining prompt coherence.

Poster Session 6

1 paper

Friday, December 5, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence

#4904 · Shaopeng Fu, Liang Ding, Jingfeng ZHANG, Di Wang

We find that it is effective to defend "long-length" jailbreak attacks via efficient "short-length" LLM adversarial training, supporting by both theoretical and empirical evidence.