1 paper across 1 session
We use influence functions to attribute and suppress training examples that promote toxic behaviors in LLMs.