Senior Researcher, Microsoft
3 papers at NeurIPS 2025
We propose an uncertainty-based routing framework that efficiently complements a fast RM with a strong but costly LLM judge.
We present the first systematic study of lossy latency–quality trade-offs in LLM agents, introducing HFTBench and StreetFighter benchmarks, and proposing an adaptive mixed-precision framework for real-world latency-sensitive tasks.
We propose Think-RM, a training framework for generative reward models that enables long-horizon reasoning, and introduce a pairwise RLHF pipeline that directly optimizes policies using pairwise preference rewards.