Researcher, Anthropic
1 paper at NeurIPS 2025
We use influence functions to attribute and suppress training examples that promote toxic behaviors in LLMs.