Dueling bandits are widely used to model preferential feedback that is prevalent in machine learning applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propose a new and highly expressive generalized linear dueling bandits model, which covers many existing models. Surprisingly, the Borda regret minimization problem turns out to be difficult, as we prove a regret lower bound of order $\Omega(d^{2/3} T^{2/3})$, where $d$ is the dimension of contextual vectors and $T$ is the time horizon. To attain the lower bound, we propose an explore-then-commit type algorithm, which has a nearly matching regret upper bound $\tilde{O}(d^{2/3} T^{2/3})$. When the number of items/arms $K$ is small, our algorithm can achieve a smaller regret $\tilde{O}( (d \log K)^{1/3} T^{2/3})$ with proper choices of hyperparameters. We also conduct empirical experiments on both synthetic data and a simulated real-world environment, which corroborate our theoretical analysis.
翻译:对决式赌博机广泛应用于对偏好反馈进行建模,这类反馈在推荐系统和排序等机器学习应用中普遍存在。本文研究对决式赌博机的波达遗憾最小化问题,旨在识别具有最高波达得分的项目,同时最小化累积遗憾。我们提出了一种新的、高度表达性的广义线性对决式赌博机模型,该模型涵盖了许多现有模型。令人惊讶的是,波达遗憾最小化问题被证明是困难的,因为我们证明了遗憾下界为$\Omega(d^{2/3} T^{2/3})$量级,其中$d$是上下文向量的维度,$T$是时间范围。为了达到该下界,我们提出了一种“先探索后承诺”型算法,该算法具有几乎匹配的上界$\tilde{O}(d^{2/3} T^{2/3})$。当项目/臂的数量$K$较小时,通过适当选择超参数,我们的算法可以实现更小的遗憾$\tilde{O}( (d \log K)^{1/3} T^{2/3})$。我们还对合成数据和模拟真实环境进行了实证实验,这些实验验证了我们的理论分析。