On the Sample Complexity Bounds of Bilevel Reinforcement Learning

Mudit Gaur, Utsav Singh, Amrit Singh Bedi, Raghu Pasupathy, Vaneet Aggarwal

Purdue University· IIT Kanpur· University of Central Florida

Bilevel Reinforcement Learning Sample Complexity Gradient Descent Bilevel Optimization

Abstract

Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of

O (ϵ^{- 3})

in continuous state-action spaces.

Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-Łojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis.

Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of

O (ϵ^{- 3})

improving upon existing bounds of

O (ϵ^{- 6})

Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.