Researcher, Microsoft Research
3 papers at NeurIPS 2025
A hybrid architecture with linear pre-filling complexity and up-to10x higher throughput on decoding.
We propose SAS to simulate larger attention head numbe and hidden size per head for better performance, keeping the original model size.
We only need one example for RLVR on LLMs to achieve significant improvement on math tasks