Intern, Microsoft Research Asia
1 paper at NeurIPS 2025
We find that it is effective to defend "long-length" jailbreak attacks via efficient "short-length" LLM adversarial training, supporting by both theoretical and empirical evidence.