Taming Adversarial Constraints in CMDPs

Francesco Emanuele Stradi, Anna Lunghi, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

Abstract

In constrained MDPs (CMDPs) with adversarial rewards and constraints, a known impossibility result prevents any algorithm from attaining sublinear regret and constraint violation, when competing against a best-in-hindsight policy that satisfies the constraints on average. In this paper, we show how to ease such a negative result, by considering settings that generalize both stochastic CMDPs and adversarial ones. We provide algorithms whose performances smoothly degrade as the level of environment adverseness increases. In this paper, we show that this negative result can be eased in CMDPs with non-stationary rewards and constraints, by providing algorithms whose performances smoothly degrade as non-stationarity increases.

Specifically, they attain

\overset{http://www.w3.org/2000/svg" width="100%" height="0.26em" viewBox="0 0 600 260" preserveAspectRatio="none">}{O} (T http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice"> + C)

regret and positive constraint violation under bandit feedback, where

C

measures the adverseness of rewards and constraints. This is

C = Θ (T)

in the worst case, coherently with the impossibility result for adversarial CMDPs.

First, we design an algorithm with the desired guarantees when

C

is known. Then, in the case

C

is unknown, we obtain the same results by embedding multiple instances of such an algorithm in a general meta-procedure, which suitably selects them so as to balance the trade-off between regret and constraint violation.