1 paper across 1 session
We characterize the structure of embeddings obtained via gradient descent, showing that the attention mechanism provably selects important tokens.