logo
today local_bar
Poster Session 4 · Thursday, December 4, 2025 4:30 PM → 7:30 PM
#507

Beyond Verifiable Rewards: Scaling Reinforcement Learning in Language Models to Unverifiable Data

NeurIPS OpenReview

Abstract

We propose to scale RL to unverifiable data with a novel algorithm JEPO (Jensen's Evidence lower bound for Policy Optimization). While most prior effort on scaling RL for LLMs focuses on verifiable data where ground truth answers are typically short-form and can be matched easily, we investigate the case where such assumptions are less valid (e.g., when answers are long-form such as mathematical proofs).
To scale RL training to unverifiable data with contemporary training constraints, we propose JEPO. JEPO applies Jensen's evidence lower bound, a pragmatic simplification of the evidence lower bound which views chain-of-thought as a latent variable in the generative process.
We show that on verifiable datasets (math), JEPO is as effective as RL with verifiable reward; on semi-verifiable and unverifiable datasets (numina and numina-proof), JEPO improves on soft-match based evaluations compared to RL with verifiable reward which can only leverage a subset of the data source as well as test set likelihood evaluations.