1 paper across 1 session
Early phase training of Transformers on algorithmic tasks shows a plateau in loss, repetition bias and representation collapse before sudden drop in loss.