1 paper across 1 session
We demonstrate scenarios where sparse attention based transformer models learn and generalize faster, and theoretically characterize conditions under which this occurs.