MS student, Université de Montréal, Mila - Quebec AI Institute
1 paper at NeurIPS 2025
A simple general purpose off-policy REINFORCE method which outperforms PPO, DPO and STaR on recent benchmarks.