Suozhi Huang

PhD student, Princeton University

1 paper at NeurIPS 2025

Homepage· OpenReview· Semantic Scholar· Google Scholar

Poster Session 4

Thursday, December 4, 2025 · 4:30 PM → 7:30 PM

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

#5202 · Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin

We find applying a query-dependent head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA) consistently improves performance, improves scaling properties and mitigates the `massive activation' and `attention sink'.