Research Staff Member, International Business Machines
1 paper at NeurIPS 2025
We demonstrate scenarios where sparse attention based transformer models learn and generalize faster, and theoretically characterize conditions under which this occurs.