Assistant Professor, Northeastern University
2 papers at NeurIPS 2025
This work stress-tests factuality metrics, showing they mostly perform well in cases that shallow models do and struggle with complex examples, fail to reward factual corrections, and are easily gamed, raising reliability concerns.