DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Gang Li, Ming Lin, Tomer Galanti, Zhengzhong Tu, Tianbao Yang

Large Language Models Large Reasoning Models Discriminative Learning

Abstract

The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias arising from its group relative advantage function. We also identify a connection between GRPO and traditional discriminative methods in supervised learning.

Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning: increasing the scores of positive answers while decreasing those of negative ones.

The main differences between DisCO and GRPO and its recent variants are:

it replaces the group relative objective with a discriminative objective defined by a scoring function;
it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions;
it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint.

As a result, DisCO offers notable advantages over GRPO and its variants:

it completely eliminates difficulty bias by adopting discriminative objectives;
it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach, yielding long and stable training dynamics;
it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training.

Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks for an 1.5B model.