Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback

Orin Levy, Liad Erez, Alon Cohen, Yishay Mansour

Regret Bandits Contextual Bandits Delayed Feedback Function Approximation

Abstract

We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over

K

actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class

Π

we establish an optimal expected regret bound of

O (K T lo g ∣Π∣ http://www.w3.org/2000/svg" width="400em" height="1.28em" viewBox="0 0 400000 1296" preserveAspectRatio="xMinYMin slice"> + D lo g ∣Π∣) http://www.w3.org/2000/svg" width="400em" height="1.28em" viewBox="0 0 400000 1296" preserveAspectRatio="xMinYMin slice">

where

D

is the sum of delays.

For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class

F

with access to an online least-square regression oracle

O

over

F

. In this setting, we achieve an expected regret bound of

O (K T R_{T} (O) http://www.w3.org/2000/svg" width="400em" height="1.28em" viewBox="0 0 400000 1296" preserveAspectRatio="xMinYMin slice"> + d_{m a x} D β http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">)

assuming FIFO order, where

d_{m a x}

is the maximal delay,

R_{T} (O)

is an upper bound on the oracle's regret and

β

is a stability parameter associated with the oracle.

We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class

F

and show that its stability parameter

β

is bounded by

lo g ∣ F ∣

, resulting in an expected regret bound of

O (K T lo g ∣ F ∣ http://www.w3.org/2000/svg" width="400em" height="1.28em" viewBox="0 0 400000 1296" preserveAspectRatio="xMinYMin slice"> + d_{m a x} D lo g ∣ F ∣ http://www.w3.org/2000/svg" width="400em" height="1.28em" viewBox="0 0 400000 1296" preserveAspectRatio="xMinYMin slice">)

which is a

d_{m a x} http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">

factor away from the lower bound of

Ω (K T lo g ∣ F ∣ http://www.w3.org/2000/svg" width="400em" height="1.28em" viewBox="0 0 400000 1296" preserveAspectRatio="xMinYMin slice"> + D lo g ∣ F ∣ http://www.w3.org/2000/svg" width="400em" height="1.28em" viewBox="0 0 400000 1296" preserveAspectRatio="xMinYMin slice">)

that we also present.