PhD student, Harvard University, Harvard University
2 papers at NeurIPS 2025
We introduce CARES, a 18K-prompt benchmark for evaluating medical safety of LLMs under adversarial conditions, with graded harms, jailbreaks, and a fine-grained response metric.
We show that explicit reasoning via chain-of-thought can hurt instruction-following in LLMs by reducing constraint adherence, and propose four mitigation methods to recover or improve performance.