Global Convergence for Average Reward Constrained MDPs with Primal-Dual Actor Critic Algorithm

Yang Xu, Swetha Ganesh, Washim Uddin Mondal, Qinbo Bai, Vaneet Aggarwal

Purdue University· Indian Institute of Science· IIT Kanpur

Reinforcement Learning Actor-Critic Methods Constrained Markov Decision Processes Policy Gradient Global Convergence

Abstract

This paper investigates infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) under general parametrized policies with smooth and bounded policy gradients.

We propose a Primal-Dual Natural Actor-Critic algorithm that adeptly manages constraints while ensuring a high convergence rate.

In particular, our algorithm achieves global convergence and constraint violation rates of

\tilde{O} (1/ T http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">)

over a horizon of length

T

when the mixing time,

τ_{mix}

, is known to the learner. In absence of knowledge of

τ_{mix}

, the achievable rates change to

\tilde{O} (1/ T^{0.5 - ϵ})

provided that

T \geq \tilde{O} (τ_{mix}^{2/ ϵ})

Our results match the theoretical lower bound for Markov Decision Processes and establish a new benchmark in the theoretical exploration of average reward CMDPs.