1 paper across 1 session
We prove precisely how deeper transformers (with appropriate rounding) become more expressive, and show that empirical behaviour tracks our theory.