Adjunct Professor, Université de Montréal
2 papers at NeurIPS 2025
A simple general purpose off-policy REINFORCE method which outperforms PPO, DPO and STaR on recent benchmarks.
We present a theoretical framework for policy convergence in RL, which permits convergence of return distribution estimates.