PhD student, Université de Montréal
1 paper at NeurIPS 2025
We improve the speed and performance of LLM post-training via a new asynchronous RL approach, leveraging an off-policy objective, replay buffer, and sampling strategies.