L $^{2}$ M: Mutual Information Scaling Law for Long-Context Language Modeling

Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljacic

NSF AI Institute for Artificial Intelligence and Fundamental Interactions· MIT· Polytechnic University of Catalonia· Harvard· University of California, Los Angeles

language models information theory mutual information predictive information large language models long context length modeling long-range dependence scaling laws long-context language modeling sequence modeling autoregressive models mutual information estimation transformers recurrent neural networks state-space models

⋅ NeurIPS ⋅ Project Page ⋅Slides ⋅OpenReview

Abstract

We present a universal theoretical framework for understanding long-context language modeling based on a bipartite mutual information scaling law that we rigorously verify in natural language.

We demonstrate that bipartite mutual information captures multi-token interactions distinct from and scaling independently of conventional two-point mutual information, and show that this provides a more complete characterization of the dependencies needed for accurately modeling long sequences. Leveraging this scaling law, we formulate the Long-context Language Modeling (L

^{2}

M) condition, which lower bounds the necessary scaling of a model's history state—the latent variables responsible for storing past information—for effective long-context modeling.

We validate the framework and its predictions on transformer and state-space models. Our work provides a principled foundation to understand long-context modeling and to design more efficient architectures with stronger long-context capabilities, with potential applications beyond natural language.