Postdoc, Carnegie Mellon University
2 papers at NeurIPS 2025
Antidistillation sampling strategically modifyies a model's next-token probability distribution to poison reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility.
We reliably predict the behavior of black-box language models by training predcitors on their responses to follow-up questions.