Toxicity - NeurIPS 2025

Toxicity

2 papers across 2 sessions

Poster Session 1

Wednesday, December 3, 2025 · 11:00 AM → 2:00 PM

Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation

#1110 · Zuhair Hasan Shaik, Abdullah Mazhar, Aseem Srivastava, Md Shad Akhtar

Our method steers LLMs away from toxic words in real time, guiding generation toward safe alternatives using the output layer’s SVD decomposition. No retraining is needed, while fluency and context are preserved.

Poster Session 6

1 paper

Friday, December 5, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss

#3604 · Pau Rodriguez, Michal Klein, Eleonora Gualdoni, Valentino Maiorca, Arno Blaas, Luca Zappella, Marco Cuturi, Xavier Suau

We propose an inference-time intervention framework based on Optimal Transport that generalizes previous methods and allows interpretable control of both LLMs and Diffusion models.