Full Professor, Carnegie Mellon University
7 papers at NeurIPS 2025
We introduce a benchmark to measure safety of general computer use agents across diverse categories of harm
Antidistillation sampling strategically modifyies a model's next-token probability distribution to poison reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility.
We present a data-centric pretraining framework that builds safety into the model from the start
We reliably predict the behavior of black-box language models by training predcitors on their responses to follow-up questions.
Open-source framework for LLM unlearning supporting multiple benchmarks and methods.