Principal Researcher, International Business Machines
2 papers at NeurIPS 2025
Changing the DenseAM kernel from the standard Gaussian kernel to the KDE-optimal Epanechnikov kernel results in 1) exponential capacity without the exponential and 2) meaningful, emergent memories.
We demonstrate scenarios where sparse attention based transformer models learn and generalize faster, and theoretically characterize conditions under which this occurs.