policy optimization

8 papers across 3 sessions

Poster Session 3

Thursday, December 4, 2025 · 11:00 AM → 2:00 PM

BREAD: Branched Rollouts from Expert Anchors Bridge SFT & RL for Reasoning

#3611 · Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, Samet Oymak

We propose BREAD, a novel and effective variant of GRPO that bridges supervised learning and reinforcement learning by employing branch rollouts from expert traces.

Purity Law for Neural Routing Problem Solvers with Enhanced Generalizability

#802 · Wenzhao Liu, Haoran Li, Congying Han, Zicheng Zhang, Anqi Li, Tiande Guo

Poster Session 4

4 papers

Thursday, December 4, 2025 · 4:30 PM → 7:30 PM

Exhibit Hall C,D,E

Sequential Monte Carlo for Policy Optimization in Continuous POMDPs

#514 · Hany Abdulsamad, Sahel Mohammad Iqbal, Simo Sarkka

We solve POMDPs by nesting sequential Monte Carlo

Policy Gradient Methods Converge Globally in Imperfect-Information Extensive-Form Games

#304 · Fivos Kalogiannis, Gabriele Farina

We contribute provable guarantees that regularized policy gradient methods converge in approximate Nash equilibria in imperfect-information extensive-form zero-sum games.

On the Sample Complexity of Differentially Private Policy Optimization

#709 · Yi He, Xingyu Zhou

We establish the first set of sample complexity bounds for private policy optimization

Diversity-Aware Policy Optimization for Large Language Model Reasoning

#4002 Spotlight · Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, KC Tan

We propose a diversity-aware policy optimization method for LLM reasoning that introduces token-level diversity focusing on positive samples, achieving higher performance improvement on mathematical benchmarks while generating more diverse solutions.

Poster Session 5

2 papers

Friday, December 5, 2025 · 11:00 AM → 2:00 PM

Exhibit Hall C,D,E

A Differential and Pointwise Control Approach to Reinforcement Learning

#3314 · Minh Nguyen, Chandrajit Bajaj

We propose Differential RL, a physics-informed framework that reformulates RL as a differential control problem. Its algorithm, dfPO, achieves pointwise convergence and outperforms standard RL in low-data scientific computing tasks.

Boosting Resilience of Large Language Models through Causality-Driven Robust Optimization

#3910 · Xiaoling Zhou, Mingjie Zhang, Zhemg Lee, YUNCHENG HUA, chengli xing, Wei Ye, Flora Salim, Shikun Zhang

This study introduces a novel causality-driven robust optimization approach that selectively updates model components sensitive to causal reasoning, enhancing model causality while preserving valuable pretrained knowledge to mitigate overfitting.