Linear Attention for Efficient Bidirectional Sequence Modeling

Arshia Afzal, Elias Abad Rocamora, Leyla Naz Candogan, Pol Puigdemont, Francesco Tonin, Yongtao Wu, Mahsa Shoaran, Volkan Cevher

EPFL

Transformer RNN Bidirectionality SSM Linear Attention Inference Training

⋅ NeurIPS ⋅ Project Page ⋅Slides ⋅OpenReview

Abstract

Linear Transformers and State Space Models have emerged as efficient alternativesto softmax Transformers for causal sequence modeling, enabling parallel trainingvia matrix multiplication and efficient RNN-style inference. However, despite theirsuccess in causal tasks, no unified framework exists for applying Linear Transformers to bidirectional sequence modeling.

We introduce LION, the first framework tosystematically extend Linear Transformers to the bidirectional setting. LION generalizes three core representations commonly used in the causal case—full LinearAttention, bidirectional RNN, and chunkwise parallel form—to the bidirectionalsetting. These forms are theoretically equivalent and enable models to exploit thestrengths of each during training and inference.

We prove that a broad class ofLinear Transformers can be extended using LION and validate our framework viathree core examples based on the choice of decay type: LION-LIT, the bidirectionalextension of 25; LION-D, based on 44; and LION-S, a variant using selectivedecay 34, 13.

Across standard bidirectional tasks, LION enables models to matchor exceed the performance of softmax Transformers, while offering significantlyfaster training and more efficient inference than existing State Space Models.