Online Inverse Linear Optimization: Efficient Logarithmic-Regret Algorithm, Robustness to Suboptimality, and Lower Bound

Shinsaku Sakaue, Taira Tsuchiya, Han Bao, Taihei Oki

CyberAgent· University of Tokyo· RIKEN AIP· Institute of Statistical Mathematics· Hokkaido University

Abstract

In online inverse linear optimization, a learner observes time-varying sets of feasible actions and an agent's optimal actions, selected by solving linear optimization over the feasible actions. The learner sequentially makes predictions of the agent's true linear objective function, and their quality is measured by the regret, the cumulative gap between optimal objective values and those achieved by following the learner's predictions. A seminal work by Bärmann et al. (2017) obtained a regret bound of

O (T http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">)

, where

T

is the time horizon. Subsequently, the regret bound has been improved to

O (n^{4} ln T)

by Besbes et al. (2021, 2025) and to

O (n ln T)

by Gollapudi et al. (2021), where

n

is the dimension of the ambient space of objective vectors.

However, these logarithmic-regret methods are highly inefficient when

T

is large, as they need to maintain regions specified by

O (T)

constraints, which represent possible locations of the true objective vector. In this paper, we present the first logarithmic-regret method whose per-round complexity is independent of

T

; indeed, it achieves the best-known bound of

O (n ln T)

. Our method is strikingly simple: it applies the online Newton step (ONS) to appropriate exp-concave loss functions.

Moreover, for the case where the agent's actions are possibly suboptimal, we establish a regret bound of

O (n ln T + Δ_{T} n ln T http://www.w3.org/2000/svg" width="400em" height="1.08em" viewBox="0 0 400000 1080" preserveAspectRatio="xMinYMin slice">)

, where

Δ_{T}

is the cumulative suboptimality of the agent's actions. This bound is achieved by using MetaGrad, which runs ONS with

Θ (ln T)

different learning rates in parallel. We also present a lower bound of

Ω (n)

, showing that the

O (n ln T)

bound is tight up to an

O (ln T)

factor.