MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

Can Yaras, Alec S Xu, Pierre Abillama, Changwoo Lee, Laura Balzano

structured matrices efficient attention monarch

Abstract

Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with

Θ (N N http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"> d)

computational complexity and

Θ (N d)

memory/IO complexity.

Unlike previous approaches, MonarchAttention is both

transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and
hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs.

With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2:

1.4 \times

for shorter sequences

(N = 256)

4.5 \times

for medium-length sequences

(N = 4 K)

, and

8.2 \times

for longer sequences

(N = 16 K)

. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts.