PhD student, Institute of Science and Technology Austria
1 paper at NeurIPS 2025
We characterize the structure of embeddings obtained via gradient descent, showing that the attention mechanism provably selects important tokens.