PhD student, Massachusetts Institute of Technology
3 papers at NeurIPS 2025
We propose a contextualized position encoding using dynamic Householder matrices in place of static rotary ones, along with a hardware-efficient training algorithm that improves state tracking performance.
a sparse attention with $\mathcal O(n \log n)$ complexity for long video generation
We find applying a query-dependent head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA) consistently improves performance, improves scaling properties and mitigates the `massive activation' and `attention sink'.