3 papers across 2 sessions
A new LLM jailbreak objective that enables more nuanced control over jailbroken responses, exploits undergeneralization of safety alignment, and improves success rates of existing jailbreaks from 14% to 80%.
GASP is a novel black-box attack framework that efficiently explores the latent space to generate human-readable adversarial suffixes, significantly improving jailbreak success rates while maintaining prompt coherence.
We introduce a noval framework for red-teaming black-box T2I systems, termed Rule-based Preference modeling Guided Red-Teaming (RPG-RT).