1 paper across 1 session
We improve the speed and performance of LLM post-training via a new asynchronous RL approach, leveraging an off-policy objective, replay buffer, and sampling strategies.