1 paper across 1 session
A simple general purpose off-policy REINFORCE method which outperforms PPO, DPO and STaR on recent benchmarks.