Principal Researcher, Microsoft
1 paper at NeurIPS 2025
RetrievalAttention improves decoding speed and reduces GPU memory usage in Transformer-based LLMs by using pre-built, attention-aware KV vector indexes stored in CPU memory, achieving significant efficiency gains without compromising accuracy.