On the Convergence of Single-Timescale Actor-Critic

Navdeep Kumar, Priyank Agrawal, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor

Technion· Columbia University· University of Zurich

Policy Gradient Reinforcement Learning Actor-Critic Algorithms Sample Complexity Convergence

Abstract

We analyze the global convergence of the single-timescale actor-critic (AC) algorithm for the infinite-horizon discounted Markov Decision Processes (MDPs) with finite state spaces.

To this end, we introduce an elegant analytical framework for handling complex, coupled recursions inherent in the algorithm.

Leveraging this framework, we establish that the algorithm converges to an

ϵ

-close globally optimal policy with a sample complexity of

O (ϵ^{- 3})

This significantly improves upon the existing complexity of

O (ϵ^{- 2})

to achieve

ϵ

-close stationary policy, which is equivalent to the complexity of

O (ϵ^{- 4})

to achieve

ϵ

-close globally optimal policy using gradient domination lemma. Furthermore, we demonstrate that to achieve this improvement, the step sizes for both the actor and critic must decay as

O (k^{- \frac{2}{3}})

with iteration

k

, diverging from the conventional

O (k^{- \frac{1}{2}})

rates commonly used in (non)convex optimization.