Bandit algorithms for online learning to rank (OLTR) problems often aim to maximize long-term revenue by utilizing user feedback. From a practical point of view, however, such algorithms have a high risk of hurting user experience due to their aggressive exploration. Thus, there has been a rising demand for safe exploration in recent years. One approach to safe exploration is to gradually enhance the quality of an original ranking that is already guaranteed acceptable quality. In this paper, we propose a safe OLTR algorithm that efficiently exchanges one of the items in the current ranking with an item outside the ranking (i.e., an unranked item) to perform exploration. We select an unranked item optimistically to explore based on Kullback-Leibler upper confidence bounds (KL-UCB) and safely re-rank the items including the selected one. Through experiments, we demonstrate that the proposed algorithm improves long-term regret from baselines without any safety violation.
翻译:面向在线学习排序(OLTR)问题的赌博机算法通常旨在通过利用用户反馈最大化长期收益。然而,从实际角度来看,此类算法因激进的探索策略而具有损害用户体验的高风险。因此,近年来对安全探索的需求日益增长。一种安全探索的方法是逐步提升已具备可接受质量的原始排序的排序质量。本文提出一种安全的OLTR算法,该算法通过将当前排序中的一个项与排序外的项(即未排名项)进行高效交换以执行探索。我们基于Kullback-Leibler上置信界(KL-UCB)乐观地选择未排名项进行探索,并安全地对包含所选项目的项集进行重新排序。实验表明,所提算法在无任何安全违规的情况下改善了基线方法的长期累积遗憾。