Conversational contextual bandits elicit user preferences by occasionally querying for explicit feedback on key-terms to accelerate learning. However, there are aspects of existing approaches which limit their performance. First, information gained from key-term-level conversations and arm-level recommendations is not appropriately incorporated to speed up learning. Second, it is important to ask explorative key-terms to quickly elicit the user's potential interests in various domains to accelerate the convergence of user preference estimation, which has never been considered in existing works. To tackle these issues, we first propose ``ConLinUCB", a general framework for conversational bandits with better information incorporation, combining arm-level and key-term-level feedback to estimate user preference in one step at each time. Based on this framework, we further design two bandit algorithms with explorative key-term selection strategies, ConLinUCB-BS and ConLinUCB-MCR. We prove tighter regret upper bounds of our proposed algorithms. Particularly, ConLinUCB-BS achieves a regret bound of $O(\sqrt{dT\log T})$, better than the previous result $O(d\sqrt{T}\log T)$. Extensive experiments on synthetic and real-world data show significant advantages of our algorithms in learning accuracy (up to 54\% improvement) and computational efficiency (up to 72\% improvement), compared to the classic ConUCB algorithm, showing the potential benefit to recommender systems.
翻译:对话上下文bandits通过偶尔询问用户对关键词的显式反馈来获取用户偏好,以加速学习过程。然而,现有方法存在若干限制其性能的问题:首先,关键词级对话与臂级推荐获得的信息未能得到适当整合以加速学习;其次,现有工作未考虑通过探索性关键词快速挖掘用户潜在兴趣领域以加速用户偏好估计收敛的重要性。针对这些问题,我们首先提出“ConLinUCB”——一种具有更优信息整合能力的通用对话bandits框架,该框架在每个时间步将臂级与关键词级反馈相结合以实现用户偏好单步估计。基于此框架,我们进一步设计了两种采用探索性关键词选择策略的bandit算法:ConLinUCB-BS与ConLinUCB-MCR。我们证明了所提算法具有更紧致的遗憾上界。特别地,ConLinUCB-BS实现了$O(\sqrt{dT\log T})$的遗憾界,优于此前$O(d\sqrt{T}\log T)$的结论。在合成数据与真实数据上的大量实验表明,与经典ConUCB算法相比,我们的算法在学习精度(最高提升54%)与计算效率(最高提升72%)方面具有显著优势,展现了其在推荐系统中的潜在应用价值。