Researcher, Alibaba Group
4 papers at NeurIPS 2025
We find applying a query-dependent head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA) consistently improves performance, improves scaling properties and mitigates the `massive activation' and `attention sink'.
A scalable system for foundation model data processing, offering 150+ multimodal OPs, cloud-native efficiency (TB-scale on 10k+ cores), and diverse interfaces (Python/APIs/chat), widely adopted in research and industry (e.g., Alibaba Cloud).