1 paper across 1 session
We show that transformers achieve length generalization when training on shorter main task and longer auxiliary tasks together.