We investigate the challenging problem of adversarial multi-armed bandits operating under time-varying constraints, a scenario motivated by numerous real-world applications. To address this complex setting, we propose a novel primal-dual algorithm that extends online mirror descent through the incorporation of suitable gradient estimators and effective constraint handling. We provide theoretical guarantees establishing sublinear dynamic regret and sublinear constraint violation for our proposed policy. Our algorithm achieves state-of-the-art performance in terms of both regret and constraint violation. Empirical evaluations demonstrate the superiority of our approach.
翻译:本文研究了时变约束下对抗性多臂老虎机这一具有挑战性的问题,该场景受到众多实际应用的驱动。为应对这一复杂设定,我们提出了一种新颖的原对偶算法,该算法通过整合合适的梯度估计器和有效的约束处理机制,扩展了在线镜像下降方法。我们为所提出的策略建立了理论保证,证明了其具有次线性动态遗憾和次线性约束违反。我们的算法在遗憾和约束违反两方面均达到了最先进的性能水平。实证评估结果验证了本方法的优越性。