PhD student, Stanford University
3 papers at NeurIPS 2025
A hybrid architecture with linear pre-filling complexity and up-to10x higher throughput on decoding.
We propose SAS to simulate larger attention head numbe and hidden size per head for better performance, keeping the original model size.