We consider the task of identifying the Copeland winner(s) in a dueling bandits problem with ternary feedback. This is an underexplored but practically relevant variant of the conventional dueling bandits problem, in which, in addition to strict preference between two arms, one may observe feedback in the form of an indifference. We provide a lower bound on the sample complexity for any learning algorithm finding the Copeland winner(s) with a fixed error probability. Moreover, we propose POCOWISTA, an algorithm with a sample complexity that almost matches this lower bound, and which shows excellent empirical performance, even for the conventional dueling bandits problem. For the case where the preference probabilities satisfy a specific type of stochastic transitivity, we provide a refined version with an improved worst case sample complexity.
翻译:我们研究在具有三元反馈的对决赌博机问题中识别科普兰赢家的任务。这是传统对决赌博机问题的一个尚未充分探索但具有实际意义的变体,其中除了两个臂之间的严格偏好外,可能观察到以无差异形式出现的反馈。我们给出了任何以固定错误概率找到科普兰赢家的学习算法在样本复杂度上的下界。此外,我们提出POCOWISTA算法,其样本复杂度几乎匹配该下界,并且在传统对决赌博机问题上也展现出卓越的实证性能。对于偏好概率满足特定随机传递性的情形,我们提供了改进版本,其最坏情况下的样本复杂度得到了优化。