Associate Professor, University of Toronto
5 papers at NeurIPS 2025
We prove a purely statistical separation between Transformers and other architectures such as feedforward and recurrent networks, where Transformers are more sample-efficient at learning sparse sequence models.
We prove the asymptotic normality of PCA on the Grassmannian, and derive a tight non-asymptotic bound on its excess risk using self-concordance.
This paper introduces distributional training data attribution, a data attribution framework that accounts for stochasticity in deep learning training, enabling a mathematical justification for why influence functions work in this setting.