Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Nilabjo Dey, Yonghao Zhuang, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Taylor W. Killian, Haonan Li, Mikhail Yurochkin, Eric P. Xing, Zhengzhong Liu, Zhiting Hu

UC San Diego· MBZUAI· Carnegie Mellon University· Purdue University

large language model reasoning reinforcement learning

⋅ NeurIPS ⋅ Project Page ⋅OpenReview

Abstract

Reinforcement learning (RL) has shown promise in enhancing large language model (LLM) reasoning, yet progress towards broader capabilities is limited by the availability of high-quality, multi-domain datasets.

This work introduces ours, a 92K RL-for-reasoning dataset designed to address this gap, covering six reasoning domains: Math, Code, Science, Logic, Simulation, and Tabular, each with corresponding verifiers. We build ours via a careful data-curation pipeline, including sourcing, deduplication, reward design, and domain-specific and difficulty-based filtering, to facilitate the systematic investigation of cross-domain RL generalization.

Our study using ours suggests the efficacy of a simple mixed-domain RL training approach and reveals several key aspects affecting cross-domain transferability. We further train two models ours-7B and ours-32B purely with RL on our curated data and observe largely improved performance over leading open RL reasoning model baselines, with gains of 7.3% and 7.8% respectively on an extensive 17-task, six-domain evaluation suite.

We are releasing our dataset, code, and evaluation suite to the community, aiming to support further research and development of more general RL-enhanced reasoning models.