Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

Heyang Zhao, Chenlu Ye, Quanquan Gu, Tong Zhang

University of California, Los Angeles· University of Illinois Urbana-Champaign

Abstract

Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness of KL-regularization has been empirically demonstrated in various practical scenarios, current theoretical analyses of KL-regularized RLHF still yield the same

O (1/ ϵ^{2})

sample complexity as ones without KL-regularization.

To understand the fundamental distinction between objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an

O (1/ ϵ)

sample complexity when

ϵ

is sufficiently small. We also prove matching lower bounds for both settings.

More specifically, we study how the coverage of the reference policy affects the sample complexity of KL-regularized online contextual bandits and RLHF. We show that with sufficient coverage from the reference policy, a simple two-stage mixed sampling algorithm can achieve an

O (1/ ϵ)

sample complexity with only an additive dependence on the coverage coefficient, thus proving the benefits of online data even without explicit exploration. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in online decision making, shedding light on the design of more efficient algorithms.