Dueling bandits are widely used to model preferential feedback prevalent in many applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propose a rich class of generalized linear dueling bandit models, which cover many existing models. We first prove a regret lower bound of order $\Omega(d^{2/3} T^{2/3})$ for the Borda regret minimization problem, where $d$ is the dimension of contextual vectors and $T$ is the time horizon. To attain this lower bound, we propose an explore-then-commit type algorithm for the stochastic setting, which has a nearly matching regret upper bound $\tilde{O}(d^{2/3} T^{2/3})$. We also propose an EXP3-type algorithm for the adversarial linear setting, where the underlying model parameter can change at each round. Our algorithm achieves an $\tilde{O}(d^{2/3} T^{2/3})$ regret, which is also optimal. Empirical evaluations on both synthetic data and a simulated real-world environment are conducted to corroborate our theoretical analysis.
翻译:对偶赌博机广泛应用于许多应用中的偏好反馈建模,如推荐系统和排序。本文研究对偶赌博机的博达遗憾最小化问题,旨在识别具有最高博达分数的项目,同时最小化累积遗憾。我们提出了一类丰富的广义线性对偶赌博机模型,该模型涵盖了众多现有模型。我们首先证明了博达遗憾最小化问题的遗憾下界为$\Omega(d^{2/3} T^{2/3})$,其中$d$是上下文向量的维度,$T$是时间范围。为达到此下界,我们针对随机场景提出了一种探索-然后-承诺型算法,其遗憾上界$\tilde{O}(d^{2/3} T^{2/3})$几乎匹配。此外,我们还针对对抗性线性场景提出了一种EXP3型算法,其中底层模型参数可在每轮变化。我们的算法实现了$\tilde{O}(d^{2/3} T^{2/3})$的遗憾值,这也是最优的。我们对合成数据和模拟真实世界环境进行了实证评估,以验证我们的理论分析。