PhD student, Nanjing University
3 papers at NeurIPS 2025
We investigate last-iterate convergence of Regret Matching$^+$ variants in games satisfying the weak Minty variation inequality.
To enhance the reliability of the reward model for current policy improvement, we have developed the Proximal Policy Exploration (PPE) algorithm to increase the coverage of the preference buffer in areas close to the near-policy distribution.
We present the first parameter-free last-iterate convergence of Counterfactual Regret Minimization algorithms.