Estimation and Inference in Distributional Reinforcement Learning

Liangyu Zhang, Yang Peng, Jiadong Liang, Wenhao Yang, Zhihua Zhang

In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency.

We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted

η^{π}

) attained by a given policy

π

. We use the certainty-equivalence method to construct our estimator

\overset{η}{^}^{\pin}

, based on a generative model. In this circumstance we need a dataset of size

\overset{http://www.w3.org/2000/svg" width="100%" height="0.26em" viewBox="0 0 600 260" preserveAspectRatio="none">}{O} (∣ S ∣∣ A ∣ ε^{- 2 p} (1 - γ)^{- (2 p + 2)})

to guarantee the supremum p-Wasserstein metric between

\overset{η}{^}^{\pin}

and

η^{π}

less than

ε

with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency.

Also, we show that under different mild assumptions a dataset of size

\overset{http://www.w3.org/2000/svg" width="100%" height="0.26em" viewBox="0 0 600 260" preserveAspectRatio="none">}{O} (∣ S ∣∣ A ∣ ε^{- 2} (1 - γ)^{- 4})

suffices to ensure the supremum Kolmogorov-Smirnov metric and supremum total variation metric between

\overset{η}{^}^{\pin}

and

η^{π}

is below

ε

with high probability.

Furthermore, we investigate the asymptotic behavior of

\overset{η}{^}^{\pin}

. We demonstrate that the "empirical process"

n http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"> (\overset{η}{^}^{\pin} - η^{π})

converges weakly to a Gaussian process in the space of bounded functionals on a Lipschitz function class

ℓ^{\infty} (F W 1)

, also in the space of bounded functionals on an indicator function class

ℓ^{\infty} (F KS)

and a bounded measurable function class

ℓ^{\infty} (F_{TV})

when some mild conditions hold.