PhD student, Automation, Tsinghua University, Tsinghua University
3 papers at NeurIPS 2025
We systematically examine the current state of RLVR and surprisingly find that it does not elicit fundamentally new reasoning patterns—revealing a gap between the potential of RL and the actual impact of current RLVR methods.
self-play reasoning RL with no data can achieve SOTA against RL models trained with human data