Poster Session 5 · Friday, December 5, 2025 11:00 AM → 2:00 PM
#3510
Neural Attention Search
Abstract
We present Neural Attention Search (NAtS), an end-to-end learnable sparse transformer that automatically evaluates the importance of each token within a sequence and determines if the corresponding token can be dropped after several steps.
To this end, we design a search space that contains three token types:
- Global Tokens will be preserved and queried by all the following tokens.
- Local Tokens survive until the next global token appears.
- Sliding Window Tokens have an impact on the inference of a fixed size of the next following tokens.
Similar to the One-Shot Neural Architecture Search approach, this token-type information can be learned jointly with the architecture weights via a learnable attention mask. Experiments on both training a new transformer from scratch and fine-tuning existing large language models show that NAtS can efficiently reduce the KV cache and the inference costs for the transformer-based models while maintaining the models' performance.