ppo - NeurIPS 2025

ppo

3 papers across 2 sessions

Poster Session 2

Wednesday, December 3, 2025 · 4:30 PM → 7:30 PM

Staggered Environment Resets Improve Massively Parallel On-Policy Reinforcement Learning

#310 · Sid Bharthulwar, Stone Tao, Hao Su

Staggered resets fix harmful nonstationarity in massively parallel RL's short synchronous rollouts by varying environment start times, improving state coverage and boosting sample efficiency, performance, and scalability.

Poster Session 5

2 papers

Friday, December 5, 2025 · 11:00 AM → 2:00 PM

Exhibit Hall C,D,E

Tapered Off-Policy REINFORCE - Stable and efficient reinforcement learning for large language models

#206 · Nicolas Le Roux, Marc Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alexandre Fréchette, Carolyne Pelletier, Eric Thibodeau-Laufer, Sándor Tóth, Sam Work

A simple general purpose off-policy REINFORCE method which outperforms PPO, DPO and STaR on recent benchmarks.

Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

#1206 · Paul Gölz, Nika Haghtalab, Kunhe Yang