3 papers across 2 sessions
Staggered resets fix harmful nonstationarity in massively parallel RL's short synchronous rollouts by varying environment start times, improving state coverage and boosting sample efficiency, performance, and scalability.
A simple general purpose off-policy REINFORCE method which outperforms PPO, DPO and STaR on recent benchmarks.