In dueling bandits, the learner receives preference feedback between arms, and the regret of an arm is defined in terms of its suboptimality to a winner arm. The more challenging and practically motivated non-stationary variant of dueling bandits, where preferences change over time, has been the focus of several recent works (Saha and Gupta, 2022; Buening and Saha, 2023; Suk and Agarwal, 2023). The goal is to design algorithms without foreknowledge of the amount of change. The bulk of known results here studies the Condorcet winner setting, where an arm preferred over any other exists at all times. Yet, such a winner may not exist and, to contrast, the Borda version of this problem (which is always well-defined) has received little attention. In this work, we establish the first optimal and adaptive Borda dynamic regret upper bound, which highlights fundamental differences in the learnability of severe non-stationarity between Condorcet vs. Borda regret objectives in dueling bandits. Surprisingly, our techniques for non-stationary Borda dueling bandits also yield improved rates within the Condorcet winner setting, and reveal new preference models where tighter notions of non-stationarity are adaptively learnable. This is accomplished through a novel generalized Borda score framework which unites the Borda and Condorcet problems, thus allowing reduction of Condorcet regret to a Borda-like task. Such a generalization was not previously known and is likely to be of independent interest.
翻译:在对决赌博机中,学习器接收臂之间的偏好反馈,臂的遗憾以其相对于优胜臂的次优性定义。更具挑战性且具有实际意义的非平稳变体——即偏好随时间变化——是近期多项研究的焦点(Saha 和 Gupta, 2022;Buening 和 Saha, 2023;Suk 和 Agarwal, 2023)。其目标是在未知变化程度的情况下设计算法。目前已知的大部分成果研究的是孔多塞优胜者场景,即始终存在一个优于其他所有臂的臂。然而,此类优胜者可能不存在,相比之下,该问题的博达版本(始终定义良好)鲜受关注。本文首次建立了最优自适应的博达动态遗憾上界,揭示了对决赌博机中孔多塞与博达遗憾目标在严重非平稳性可学习性上的根本差异。令人惊讶的是,我们用于非平稳博达对决赌博机的技术同时在孔多塞优胜者场景中产生了改进的速率,并揭示了可自适应学习更紧致非平稳性度量的新型偏好模型。这一成果通过新颖的广义博达得分框架实现,该框架统一了博达与孔多塞问题,从而可将孔多塞遗憾约简为类似博达的任务。此类泛化此前尚未见报道,且可能具有独立的研究价值。