Postdoc, INRIA
1 paper at NeurIPS 2025
On a linear bigram model, we show that the heavy-tailed distribution found in text requires training for a number of iteration that scales linearly with the vocabulary size with gradient descent but only a square-root scaling with sign descent.