1 paper across 1 session
We propose spectral conditioning of attention layers to improve Jacobian conditioning, leading to more stable and efficient optimization with negligible computational overhead and consistent gains across diverse transformer architectures.