1 paper across 1 session
We show that recurrent-depth transformers can be scaled to be effective language models, with particularly strong gains through additional compute for reasoning tasks.