Associate Professor, Shanghai Jiao Tong University
3 papers at NeurIPS 2025
Training a new reasoning paradigm of LLMs explicitly contains meta-thinking in a multi-agent and multi-turn setting with RL
An efficient PbRL method that mitigates overfitting and overestimation via dual regularization, enhancing feedback efficiency in both online and offline settings