We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select $m \geq 2$ arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative $T$-round regret of MiDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$. Moreover, we prove a lower bound of $\Omega(K^{1/3} T^{2/3})$ for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.
翻译:本文引入了对抗性多重对决赌博机中的遗憾最小化问题。尽管对抗性偏好已在对决赌博机中得到研究,但在多重对决赌博机中尚未被探索。在此设定下,学习器需要在每轮选择 $m \geq 2$ 个臂,并基于一个任意且非适应性选择的偏好矩阵,观测到最受偏好臂的身份作为反馈。我们提出了一种新颖算法 MiDEX(多重对决 EXP3),用于从假设由成对子集选择模型生成的此类偏好反馈中学习。我们证明了 MiDEX 与从 $K$ 个臂集合中选出的 Borda 胜者相比,其 $T$ 轮期望累积遗憾的上界为 $O((K \log K)^{1/3} T^{2/3})$。此外,我们证明了在此设定下期望遗憾的下界为 $\Omega(K^{1/3} T^{2/3})$,这表明我们提出的算法是近乎最优的。