Conversational recommender systems (CRS) explicitly solicit users' preferences for improved recommendations on the fly. Most existing CRS solutions count on a single policy trained by reinforcement learning for a population of users. However, for users new to the system, such a global policy becomes ineffective to satisfy them, i.e., the cold-start challenge. In this paper, we study CRS policy learning for cold-start users via meta-reinforcement learning. We propose to learn a meta policy and adapt it to new users with only a few trials of conversational recommendations. To facilitate fast policy adaptation, we design three synergetic components. Firstly, we design a meta-exploration policy dedicated to identifying user preferences via a few exploratory conversations, which accelerates personalized policy adaptation from the meta policy. Secondly, we adapt the item recommendation module for each user to maximize the recommendation quality based on the collected conversation states during conversations. Thirdly, we propose a Transformer-based state encoder as the backbone to connect the previous two components. It provides comprehensive state representations by modeling complicated relations between positive and negative feedback during the conversation. Extensive experiments on three datasets demonstrate the advantage of our solution in serving new users, compared with a rich set of state-of-the-art CRS solutions.
翻译:对话推荐系统(CRS)通过实时向用户明确询问偏好来优化推荐效果。现有大多数CRS方案依赖强化学习为全体用户训练单一策略。但对于系统新用户而言,这种全局策略难以满足其需求,即存在冷启动挑战。本文通过元强化学习研究面向冷启动用户的CRS策略学习问题。我们提出学习一个元策略,并仅通过少量对话推荐试验将其适配至新用户。为加速策略适配,我们设计了三个协同组件:首先,设计一个元探索策略,专门通过少量探索性对话识别用户偏好,从而加速基于元策略的个性化策略适配;其次,为每个用户适配项目推荐模块,基于对话过程中收集的状态信息最大化推荐质量;第三,提出基于Transformer的状态编码器作为连接前两个组件的骨干网络,通过建模对话中正负反馈间的复杂关系,提供全面的状态表征。在三个数据集上的大量实验表明,与多种最先进的CRS方案相比,本方案在服务新用户方面具有显著优势。