Conversational contextual bandits elicit user preferences by occasionally querying for explicit feedback on key-terms to accelerate learning. However, there are aspects of existing approaches which limit their performance. First, information gained from key-term-level conversations and arm-level recommendations is not appropriately incorporated to speed up learning. Second, it is important to ask explorative key-terms to quickly elicit the user's potential interests in various domains to accelerate the convergence of user preference estimation, which has never been considered in existing works. To tackle these issues, we first propose ``ConLinUCB", a general framework for conversational bandits with better information incorporation, combining arm-level and key-term-level feedback to estimate user preference in one step at each time. Based on this framework, we further design two bandit algorithms with explorative key-term selection strategies, ConLinUCB-BS and ConLinUCB-MCR. We prove tighter regret upper bounds of our proposed algorithms. Particularly, ConLinUCB-BS achieves a regret bound of $O(d\sqrt{T\log T})$, better than the previous result $O(d\sqrt{T}\log T)$. Extensive experiments on synthetic and real-world data show significant advantages of our algorithms in learning accuracy (up to 54\% improvement) and computational efficiency (up to 72\% improvement), compared to the classic ConUCB algorithm, showing the potential benefit to recommender systems.
翻译:对话上下文Bandit通过偶尔询问用户对关键词的显式反馈来获取偏好信息,从而加速学习过程。然而,现有方法在性能上存在若干局限性。首先,关键词级对话与臂级推荐所获取的信息未能有效整合以加速学习。其次,现有工作从未考虑通过提出探索性关键词来快速挖掘用户在不同领域的潜在兴趣,以加速用户偏好估计的收敛。为解决这些问题,我们首先提出"ConLinUCB"——一种具备更优信息整合能力的通用对话Bandit框架,该框架在每个时间步将臂级与关键词级反馈统一用于用户偏好估计。基于此框架,我们进一步设计了两种具有探索性关键词选择策略的Bandit算法:ConLinUCB-BS和ConLinUCB-MCR。我们证明了所提算法具有更紧致的遗憾上界,其中ConLinUCB-BS的遗憾界达到$O(d\sqrt{T\log T})$,优于此前结果$O(d\sqrt{T}\log T)$。在合成数据与真实数据上的大量实验表明,与经典ConUCB算法相比,我们的算法在学习准确率(最高提升54%)与计算效率(最高提升72%)上均具有显著优势,彰显了其对推荐系统的潜在价值。