2 papers across 2 sessions
A new LLM jailbreak objective that enables more nuanced control over jailbroken responses, exploits undergeneralization of safety alignment, and improves success rates of existing jailbreaks from 14% to 80%.
We learn a generative model of the Pareto set that can be conditioned on subjective preferences, without retraining, for online multi-objective optimization tasks on discrete/mixed spaces.