Research Staff Member, International Business Machines
2 papers at NeurIPS 2025
We demonstrate scenarios where sparse attention based transformer models learn and generalize faster, and theoretically characterize conditions under which this occurs.
Nonlinear systems whose future behavior is not overly sensitive to small perturbations can be efficiently parallelized; whereas unpredictable dynamical systems cannot be efficiently parallelized.