Combinatorial bandits extend the classical bandit framework to settings where the learner selects multiple arms in each round, motivated by applications such as online recommendation and assortment optimization. While extensions of upper confidence bound (UCB) algorithms arise naturally in this context, adapting arm elimination methods has proved more challenging. We introduce a novel elimination scheme that partitions arms into three categories (confirmed, active, and eliminated), and incorporates explicit exploration to update these sets. We demonstrate the efficacy of our algorithm in two settings: the combinatorial multi-armed bandit with general graph feedback, and the combinatorial linear contextual bandit. In both cases, our approach achieves near-optimal regret, whereas UCB-based methods can provably fail due to insufficient explicit exploration. Matching lower bounds are also provided.
翻译:组合多臂赌博机将经典赌博机框架扩展至学习者在每轮中选择多个臂的场景,其应用动机包括在线推荐和品类优化等。虽然置信上界(UCB)算法的扩展在此背景下自然产生,但适配臂淘汰方法已被证明更具挑战性。我们提出了一种新颖的淘汰方案,将臂划分为三类(已确认、活跃和已淘汰),并通过显式探索来更新这些集合。我们在两种场景中验证了算法的有效性:具有通用图反馈的组合多臂赌博机,以及组合线性上下文赌博机。在这两种情况下,我们的方法均实现了接近最优的遗憾度,而基于UCB的方法由于显式探索不足可能被证明会失效。本文同时提供了匹配的下界证明。