1 paper across 1 session
We imrove sample complexity of single time scale actor critic to $O(\epsilon^{-3})$ from $O(\epsilon^{-4})$ for obtaining $\epsilon$-close global optimal policy.