logo
today local_bar
Poster Session 4 · Thursday, December 4, 2025 4:30 PM → 7:30 PM
#3306

Foundations of Top- Decoding for Language Models

NeurIPS OpenReview

Abstract

Top- decoding is a widely used method for sampling from LLMs: at each token, only the largest next-token-probabilities are kept, and the next token is sampled after re-normalizing them to sum to unity. Top- and other sampling methods are motivated by the intuition that true next-token distributions are sparse, and the noisy LLM probabilities need to be truncated. However, to our knowledge, a precise theoretical motivation for the use of top- decoding is missing.
In this work, we develop a theoretical framework that both explains and generalizes top- decoding. We view decoding at a fixed token as the recovery of a sparse probability distribution. We introduce Bregman decoders obtained by minimizing a separable Bregman divergence (for both the primal and dual cases) with a sparsity-inducing -regularization; in particular, these decoders are adaptive in the sense that the sparsity parameter is chosen depending on the underlying token distribution.
Despite the combinatorial nature of the sparse Bregman objective, we show how to optimize it efficiently for a large class of divergences. We prove that:
  1. the optimal decoding strategies are greedy, and further that
  2. the objective is discretely convex in , such that the optimal can be identified in logarithmic time.
We note that standard top- decoding arises as a special case for the KL divergence, and construct new decoding strategies with substantially different behaviors (e.g., non-linearly up-weighting larger probabilities after re-normalization).