Poster Session 4 · Thursday, December 4, 2025 4:30 PM → 7:30 PM
#3901
LM: Mutual Information Scaling Law for Long-Context Language Modeling
NSF AI Institute for Artificial Intelligence and Fundamental Interactions· MIT· Polytechnic University of Catalonia· Harvard· University of California, Los Angeles
language modelsinformation theorymutual informationpredictive informationlarge language modelslong context length modelinglong-range dependencescaling lawslong-context language modelingsequence modelingautoregressive modelsmutual information estimationtransformersrecurrent neural networksstate-space models
Abstract
We present a universal theoretical framework for understanding long-context language modeling based on a bipartite mutual information scaling law that we rigorously verify in natural language.
We demonstrate that bipartite mutual information captures multi-token interactions distinct from and scaling independently of conventional two-point mutual information, and show that this provides a more complete characterization of the dependencies needed for accurately modeling long sequences. Leveraging this scaling law, we formulate the Long-context Language Modeling (LM) condition, which lower bounds the necessary scaling of a model's history state—the latent variables responsible for storing past information—for effective long-context modeling.
We validate the framework and its predictions on transformer and state-space models. Our work provides a principled foundation to understand long-context modeling and to design more efficient architectures with stronger long-context capabilities, with potential applications beyond natural language.