2 papers across 1 session
We show that learning sparse attention is prone to emerging behaviors during training, and study (theoretically and empirically) how data and model design influence emergence speed.