PhD student, Nanyang Technological University
2 papers at NeurIPS 2025
To enhance the reliability of the reward model for current policy improvement, we have developed the Proximal Policy Exploration (PPE) algorithm to increase the coverage of the preference buffer in areas close to the near-policy distribution.
We propose MTRec, a novel sequential recommendation framework which uses a learned mental reward model to guide the recommendation model to align with users' real preferences.