2 papers across 2 sessions
Reframing black-box LLM jailbreaking as a variational inference problem to efficiently generate diverse and fluent adversarial prompts that outperform traditional genetic methods.
Exploiting the overfitting of LLMs, we use only ten benign QA pairs to fine-tune and jailbreak them.