PhD student, University of Wisconsin - Madison
1 paper at NeurIPS 2025
We show that transformers achieve length generalization when training on shorter main task and longer auxiliary tasks together.