Optimal Regret Bounds via Low-Rank Structured Variation in Non-Stationary Reinforcement Learning

Hanoi University of Science and Technology

Abstract

We study reinforcement learning in non-stationary communicating MDPs whosetransition drift admits a low-rank plus sparse structure. We proposeSVUCRL (Structured Variation UCRL) and prove the dynamic-regret bound

\overset{http://www.w3.org/2000/svg" width="100%" height="0.26em" viewBox="0 0 600 260" preserveAspectRatio="none">}{O} (D_{m a x} S A T http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"> + D_{m a x} (B_{r} + B_{p}) K ST http://www.w3.org/2000/svg" width="400em" height="1.28em" viewBox="0 0 400000 1296" preserveAspectRatio="xMinYMin slice"> + D_{m a x} δ_{B} B_{p}) .

where

S

is the number of states,

A

the number of actions,

T

the horizon,

D_{m a x}

the MDP diameter,

B_{r}

B_{p}

the total reward/transition variationbudgets, and

K S A

the rank of the structured drift.

The first term is the statistical price of learning in stationary problems; the second is the non-stationarity price, which scales with

K http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">

rather than

S A http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">

when drift is low-rank. This matches the

T http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">

rate (uptologs) and improves on prior

T^{3/4}

-type guarantees.

SVUCRL combines:

online low-rank tracking with explicit Frobenius guarantees,
incremental RPCA to separate structured drift from sparse shocks,
adaptive confidence widening via a bias-corrected local-variationestimator, and
factor forecasting with an optimal shrinkage center.