Senior Principal Researcher, Research, Microsoft
4 papers at NeurIPS 2025
By carefully coordinating off-the-shelf models with inference only, we show the DSP framework can achieve surprisingly good results in theorem proving, comparable to the frontier models with RL-based large-scale training.
We introduce rStar-Coder to train advanced code reasoning LLMs, with our 14B model achieving comparable performance to QWQ-32B.
RetrievalAttention improves decoding speed and reduces GPU memory usage in Transformer-based LLMs by using pre-built, attention-aware KV vector indexes stored in CPU memory, achieving significant efficiency gains without compromising accuracy.