PhD student, Stanford University
2 papers at NeurIPS 2025
We propose a contextualized position encoding using dynamic Householder matrices in place of static rotary ones, along with a hardware-efficient training algorithm that improves state tracking performance.
We find applying a query-dependent head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA) consistently improves performance, improves scaling properties and mitigates the `massive activation' and `attention sink'.