Follow-the-Perturbed-Leader Nearly Achieves Best-of-Both-Worlds for the m-Set Semi-Bandit Problems

Jingxin Zhan, Yuchen Xin, Chenjie Sun, Zhihua Zhang

multi-armed bandit combinatorial semi-bandit m-set semi-bandit best-of-both-worlds follow-the-perturbed-leader Fréchet distribution

⋅ NeurIPS ⋅ OpenReview

Abstract

We consider a common case of the combinatorial semi-bandit problem, the

m

-set semi-bandit, where the learner exactly selects

m

arms from the total

d

arms. In the adversarial setting, the best regret bound, known to be

O (nm d http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">)

for time horizon

n

, is achieved by the well-known Follow-the-Regularized-Leader (FTRL) policy. However, this requires to explicitly compute the arm-selection probabilities via optimizing problems at each time step and sample according to them.

This problem can be avoided by the Follow-the-Perturbed-Leader (FTPL) policy, which simply pulls the

m

arms that rank among the

m

smallest (estimated) loss with random perturbation.

In this paper, we show that FTPL with a Fréchet perturbation also enjoys the near optimal regret bound

O (nm http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"> (d lo g (d) http://www.w3.org/2000/svg" width="400em" height="1.28em" viewBox="0 0 400000 1296" preserveAspectRatio="xMinYMin slice"> + m^{5/6}))

in the adversarial setting and approaches best-of-both-world regret bounds, i.e., achieves a logarithmic regret for the stochastic setting. Moreover, our lower bounds show that the extra factors are unavoidable with our approach; any improvement would require a fundamentally different and more challenging method.