Policy Gradient Methods Converge Globally in Imperfect-Information Extensive-Form Games

policy gradient methods policy optimization MARL extensive-form games zero-sum extensive-form games imperfect-information PL Polyak- Łojasiewicz natural policy gradient softmax parametrization REINFORCE hidden convexity hidden concave hidden convex

⋅ NeurIPS ⋅ OpenReview

Abstract

Multi-agent reinforcement learning (MARL) has long been seen as inseparable from Markov games (Littman 1994). Yet, the most remarkable achievements of practical MARL have arguably been in extensive-form games (EFGs)---spanning games like Poker, Stratego, and Hanabi. At the same time, little is known about provable equilibrium convergence for MARL algorithms applied to EFGs as they stumble upon the inherent nonconvexity of the optimization landscape and the failure of the value-iteration subroutine in EFGs.

To this goal, we utilize contemporary advances in nonconvex optimization theory to prove that regularized alternating policy gradient with

direct policy parametrization,
softmax policy parametrization, and
softmax policy parametrization with natural policy gradient

updates converge to an approximate Nash equilibrium (NE) in the last-iterate in imperfect-information perfect-recall zero-sum EFGs. Namely, we observe that since the individual utilities are concave with respect to the sequence-form strategy, they satisfy gradient dominance w.r.t. the behavioral strategy---or, policy, in reinforcement learning terms.

We exploit this structure to further prove that the regularized utility satisfies the much stronger proximal Polyak- Łojasiewicz condition. In turn, we show that the different flavors of alternating policy gradient methods converge to an

ϵ

-approximate NE with a number of iterations and trajectory samples that are polynomial in

1/ ϵ

and the natural parameters of the game. Our work is a preliminary---yet principled---attempt in bridging the conceptual gap between the theory of Markov and imperfect-information EFGs while it aspires to stimulate a deeper dialogue between them.