UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression

Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Zhicheng Dou

Renmin University of China· Tencent AI Lab· City University of Hong Kong

Long context compression sparse attention

Abstract

Large language models are increasingly capable of handling long-context inputs, but the memory overhead of KV cache remains a major bottleneck for general-purpose deployment. While many compression strategies have been explored, sequence-level compression is particularly challenging due to its tendency to lose important details.

We present UniGist, a gist token-based long context compression framework that removes the need for chunk-wise training, enabling the model to learn how to compress and utilize long-range context during training.

To fully exploit the sparsity, we introduce a gist shift trick that transforms the attention layout into a right-aligned block structure and develop a block-table-free sparse attention kernel based on it. UniGist further supports one-pass training and flexible chunk sizes during inference, allowing efficient and adaptive context processing.

Experiments across multiple long-context tasks show that UniGist significantly improves compression quality, with especially strong performance in recalling details and long-range dependency modeling.