Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret bounds that account for the inherent uncertainty in pairwise comparisons between the dueling arms. Intuitively, greater uncertainty suggests a higher level of difficulty in the problem. To bridge this gap, this paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM). We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $\tilde O\big(d\sqrt{\sum_{t=1}^T\sigma_t^2} + d\big)$, where $\sigma_t$ is the variance of the pairwise comparison in round $t$, $d$ is the dimension of the context vectors, and $T$ is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $\tilde O(d)$ regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.
翻译:对决赌博机是一种基于偏好反馈进行决策的重要框架,该特性适用于涉及人类交互的多种应用场景,例如排序、信息检索和推荐系统。尽管已有大量研究致力于最小化对决赌博机中的累积遗憾,但当前研究仍存在明显空白——缺乏能够反映对决臂之间成对比较固有不确定性的遗憾界。直观而言,更大的不确定性意味着问题难度更高。为填补这一空白,本文研究了上下文对决赌博机问题,其中对决臂的二元比较结果由广义线性模型(GLM)生成。我们提出了一种新型SupLinUCB类型算法,该算法兼具计算高效性与考虑方差的遗憾界$\tilde O\big(d\sqrt{\sum_{t=1}^T\sigma_t^2} + d\big)$,其中$\sigma_t$表示第$t$轮成对比较的方差,$d$为上下文向量维度,$T$为时间范围。我们的遗憾界自然契合直观预期:当比较结果具有确定性时,算法仅遭受$\tilde O(d)$量级的遗憾。我们在合成数据上进行了实证实验,验证了本方法相较于以往忽视方差算法具有更优性能。