Poster Session 4 · Thursday, December 4, 2025 4:30 PM → 7:30 PM
#304
Policy Gradient Methods Converge Globally in Imperfect-Information Extensive-Form Games
Abstract
Multi-agent reinforcement learning (MARL) has long been seen as inseparable from Markov games (Littman 1994). Yet, the most remarkable achievements of practical MARL have arguably been in extensive-form games (EFGs)---spanning games like Poker, Stratego, and Hanabi. At the same time, little is known about provable equilibrium convergence for MARL algorithms applied to EFGs as they stumble upon the inherent nonconvexity of the optimization landscape and the failure of the value-iteration subroutine in EFGs.
To this goal, we utilize contemporary advances in nonconvex optimization theory to prove that regularized alternating policy gradient with
- direct policy parametrization,
- softmax policy parametrization, and
- softmax policy parametrization with natural policy gradient
We exploit this structure to further prove that the regularized utility satisfies the much stronger proximal Polyak- Łojasiewicz condition. In turn, we show that the different flavors of alternating policy gradient methods converge to an -approximate NE with a number of iterations and trajectory samples that are polynomial in and the natural parameters of the game. Our work is a preliminary---yet principled---attempt in bridging the conceptual gap between the theory of Markov and imperfect-information EFGs while it aspires to stimulate a deeper dialogue between them.