Breaking the Order Barrier: Off-Policy Evaluation for Confounded POMDPs

Qi Kuang, Jiayi Wang, Fan Zhou, Zhengling Qi

Jiangxi University of Finance and Economics· University of Texas at Dallas· Shanghai University of Finance and Economics· George Washington University

off-policy evaluation confounded POMDPs identification

⋅ NeurIPS ⋅ Poster ⋅OpenReview

Abstract

We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs) with unobserved confounding. Recent advances have introduced bridge-function to circumvent unmeasured confounding and develop estimators for the policy value, yet the statistical error bounds of them related to the length of horizon

T

and the size of the state-action space

∣ O ∣∣ A ∣

remain largely unexplored.

In this paper, we systematically investigate the finite-sample error bounds of OPE estimators in finite-horizon tabular confounded POMDPs. Specifically, we show that under certain rank conditions, the estimation error for policy value can achieve a rate of

O (T^{1.5} / n http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">)

, excluding the cardinality of the observation space

∣ O ∣

and the action space

∣ A ∣

. With an additional mild condition on the concentrability coefficients in confounded POMDPs, the rate of estimation error can be improved to

O (T / n http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">)

We also show that for a fully history-dependent policy, the estimation error scales as

O (T / n http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"> (∣ O ∣∣ A ∣)^{\frac{T}{2}})

, highlighting the exponential error dependence introduced by history-based proxies to infer hidden states. Furthermore, when the target policy is memoryless policy, the error bound improves to

O (T / n http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"> ∣ O ∣∣ A ∣ http://www.w3.org/2000/svg" width="400em" height="1.28em" viewBox="0 0 400000 1296" preserveAspectRatio="xMinYMin slice">)

, which matches the optimal rate known for tabular MDPs.

To the best of our knowledge, this is the first work to provide a comprehensive finite-sample analysis of OPE in confounded POMDPs.