Attention with Trained Embeddings Provably Selects Important Tokens

Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli

attention layer token embedding gradient flow max-margin

Abstract

Token embeddings play a crucial role in language modeling but, despite this practical relevance, their theoretical understanding is limited.

Our paper addresses the gap by characterizing the structure of embeddings obtained via gradient descent. Specifically, we consider a one-layer softmax attention model with a linear head for binary classification, i.e.,

Softmax (p^{⊤} E_{X}^{⊤}) E_{X} v = \frac{\sum _{i = 1}^{T} e x p ( p ^{⊤} E _{x_{i}} ) E _{x_{i}}^{⊤} v}{\sum _{j = 1}^{T} e x p ( p ^{⊤} E _{x_{j}} )}

, where

E_{X} = E_{x_{1}}, \dots, E_{x_{T}}^{⊤}

contains the embeddings of the input sequence,

p

is the embedding of the

⟨ cls ⟩

token and

v

the output vector. First, we show that, already after a single step of gradient training with the standard logistic loss, the embeddings

E_{X}

capture the importance of tokens in the dataset by aligning with the output vector

v

proportionally to the corresponding average signed frequency that captures the relevance of tokens to the labels.

Then, after training

p

via gradient flow until convergence, the softmax selects the important tokens in the sentence (i.e., those that are predictive of the label), and the resulting

⟨ cls ⟩

embedding maximizes the margin for such a selection. Experiments on real-world datasets (IMDB, Yelp) exhibit a phenomenology close to that unveiled by our theory.