$Q ♯$ : Provably Optimal Distributional RL for LLM Post-Training

Jin Peng Zhou, Kaiwen Wang, Jonathan Daniel Chang, Zhaolin Gao, Nathan Kallus, Kilian Q Weinberger, Kianté Brantley, Wen Sun

Cornell University· Harvard· Netflix· Databricks

large language models distributional RL reasoning

⋅ NeurIPS ⋅ Project Page ⋅Poster ⋅OpenReview

Abstract

Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce

Q ♯

, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized

Q

function. We propose to learn the optimal

Q

function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized

Q

-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem.

Empirically,

Q ♯

outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy.

Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance.

In sum, our results highlight

Q ♯

as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at https://github.com/jinpz/qsharp.