5 papers across 3 sessions
We prove a purely statistical separation between Transformers and other architectures such as feedforward and recurrent networks, where Transformers are more sample-efficient at learning sparse sequence models.
We propose LION, a framework for extending Linear Transformers to the bidirectional setting by providing three theoretically equivalent representations: full attention, bidirectional RNN, and chunkwise parallel form.
RNNs used in computational neuroscience lie on manifolds whose geometry provides insights into their computations.
A sparse attention mechanism balances efficiency, long-range random access flexibility and length generalization ability
We introduce TiledFlashLinearAttention a faster kernel algorithm for Linear RNNs and mLSTMs by improved Sequence Parallelism.