2 papers across 2 sessions
Our method steers LLMs away from toxic words in real time, guiding generation toward safe alternatives using the output layer’s SVD decomposition. No retraining is needed, while fluency and context are preserved.
We propose an inference-time intervention framework based on Optimal Transport that generalizes previous methods and allows interpretable control of both LLMs and Diffusion models.