Assistant Professor, Department of Computer Science, University of Washington
3 papers at NeurIPS 2025
A framework and benchmark to evaluate language models' reasoning on imperfect tabular data
We introduce a framework for evaluating & improving LLM consistency in simulated human dialogue. Our metrics correlate with human judgments and when used with multi-turn RL, reduce inconsistency across chit-chat, teaching and mental health dialogue.