logo
today local_bar
Poster Session 2 · Wednesday, December 3, 2025 4:30 PM → 7:30 PM
#3506 Spotlight

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

NeurIPS OpenReview

Abstract

Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with computational complexity and memory/IO complexity.
Unlike previous approaches, MonarchAttention is both
  1. transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and
  2. hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs.
With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: for shorter sequences , for medium-length sequences , and for longer sequences . We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts.