We study online inverse linear optimization, also known as contextual recommendation, where a learner sequentially infers an agent's hidden objective vector from observed optimal actions over feasible sets that change over time. The learner aims to recommend actions that perform well under the agent's true objective, and the performance is measured by the regret, defined as the cumulative gap between the agent's optimal values and those achieved by the learner's recommended actions. Prior work has established a regret bound of $O(d\log T)$, as well as a finite but exponentially large bound of $\exp(O(d\log d))$, where $d$ is the dimension of the optimization problem and $T$ is the time horizon, while a regret lower bound of $Ω(d)$ is known (Gollapudi et al. 2021; Sakaue et al. 2025). Whether a finite regret bound polynomial in $d$ is achievable or not has remained an open question. We partially resolve this by showing that when the feasible sets are M-convex -- a broad class that includes matroids -- a finite regret bound of $O(d\log d)$ is possible. We achieve this by combining a structural characterization of optimal solutions on M-convex sets with a geometric volume argument. Moreover, we extend our approach to adversarially corrupted feedback in up to $C$ rounds. We obtain a regret bound of $O((C+1)d\log d)$ without prior knowledge of $C$, by monitoring directed graphs induced by the observed feedback to detect corruptions adaptively.
翻译:本文研究在线逆线性优化问题,亦称上下文推荐问题,其中学习者从随时间变化的可行集上观察到的代理最优动作中,顺序推断其隐藏的目标向量。学习者的目标是推荐在代理真实目标下表现良好的动作,其性能通过遗憾值来衡量,即代理最优值与学习者推荐动作所实现值之间的累积差距。先前研究已证明遗憾界为 $O(d\log T)$,以及一个有限但指数级大的界 $\exp(O(d\log d))$,其中 $d$ 为优化问题的维度,$T$ 为时间范围,而已知遗憾下界为 $Ω(d)$(Gollapudi 等人,2021;Sakaue 等人,2025)。是否能够实现一个关于 $d$ 的多项式有限遗憾界一直是一个悬而未决的问题。我们通过证明当可行集为 M-凸集——一类包含拟阵的广泛集合——时,可以实现 $O(d\log d)$ 的有限遗憾界,从而部分解决了这一问题。我们通过结合 M-凸集上最优解的结构特性与几何体积论证实现了这一结果。此外,我们将方法扩展到对抗性干扰反馈场景中,最多允许 $C$ 轮干扰。通过监测由观测反馈诱导的有向图来自适应检测干扰,我们在无需预先知道 $C$ 的情况下,获得了 $O((C+1)d\log d)$ 的遗憾界。