PhD student, Johns Hopkins University
1 paper at NeurIPS 2025
We propose a scale-invariant attention mechanism for transformers and show it improves performance in out-of-distribution long-context settings.