Principal Researcher, Alibaba Group
6 papers at NeurIPS 2025
This paper introduce CoRT, a post-training framework for teaching large reasoning LLMs to leverage CI effectively and efficiently.
We find applying a query-dependent head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA) consistently improves performance, improves scaling properties and mitigates the `massive activation' and `attention sink'.