3 papers across 2 sessions
We mix discrete and continuous adversarial attacks to adversarially train more robust LLMs. We evaluate our models in different realistic inference settings and show that they are more robust while matching the training cost of other SoTA models.
GASP is a novel black-box attack framework that efficiently explores the latent space to generate human-readable adversarial suffixes, significantly improving jailbreak success rates while maintaining prompt coherence.
We find that it is effective to defend "long-length" jailbreak attacks via efficient "short-length" LLM adversarial training, supporting by both theoretical and empirical evidence.