logo
today local_bar
Poster Session 2 · Wednesday, December 3, 2025 4:30 PM → 7:30 PM
#3700

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

NeurIPS Project Page OpenReview

Abstract

The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias arising from its group relative advantage function. We also identify a connection between GRPO and traditional discriminative methods in supervised learning.
Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning: increasing the scores of positive answers while decreasing those of negative ones.
The main differences between DisCO and GRPO and its recent variants are:
  1. it replaces the group relative objective with a discriminative objective defined by a scoring function;
  2. it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions;
  3. it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint.
As a result, DisCO offers notable advantages over GRPO and its variants:
  1. it completely eliminates difficulty bias by adopting discriminative objectives;
  2. it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach, yielding long and stable training dynamics;
  3. it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training.
Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks for an 1.5B model.