PhD student, Department of Automation, Tsinghua University
2 papers at NeurIPS 2025
self-play reasoning RL with no data can achieve SOTA against RL models trained with human data