3 papers across 3 sessions
A simple general purpose off-policy REINFORCE method which outperforms PPO, DPO and STaR on recent benchmarks.
We improve the speed and performance of LLM post-training via a new asynchronous RL approach, leveraging an off-policy objective, replay buffer, and sampling strategies.