Steering AI

1 paper across 1 session

Poster Session 1

Wednesday, December 3, 2025 · 11:00 AM → 2:00 PM

Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation

#1110 · Zuhair Hasan Shaik, Abdullah Mazhar, Aseem Srivastava, Md Shad Akhtar

Our method steers LLMs away from toxic words in real time, guiding generation toward safe alternatives using the output layer’s SVD decomposition. No retraining is needed, while fluency and context are preserved.