8 papers across 3 sessions
We propose BREAD, a novel and effective variant of GRPO that bridges supervised learning and reinforcement learning by employing branch rollouts from expert traces.
We solve POMDPs by nesting sequential Monte Carlo
We contribute provable guarantees that regularized policy gradient methods converge in approximate Nash equilibria in imperfect-information extensive-form zero-sum games.
We establish the first set of sample complexity bounds for private policy optimization
We propose a diversity-aware policy optimization method for LLM reasoning that introduces token-level diversity focusing on positive samples, achieving higher performance improvement on mathematical benchmarks while generating more diverse solutions.
We propose Differential RL, a physics-informed framework that reformulates RL as a differential control problem. Its algorithm, dfPO, achieves pointwise convergence and outperforms standard RL in low-data scientific computing tasks.
This study introduces a novel causality-driven robust optimization approach that selectively updates model components sensitive to causal reasoning, enhancing model causality while preserving valuable pretrained knowledge to mitigate overfitting.