PhD student, Department of Computer Science, University of Toronto
2 papers at NeurIPS 2025
We prove a purely statistical separation between Transformers and other architectures such as feedforward and recurrent networks, where Transformers are more sample-efficient at learning sparse sequence models.