Towards Building Voice-based Conversational Recommender Systems: Datasets, Potential Solutions, and Prospects

Conversational recommender systems (CRSs) have become crucial emerging research topics in the field of RSs, thanks to their natural advantages of explicitly acquiring user preferences via interactive conversations and revealing the reasons behind recommendations. However, the majority of current CRSs are text-based, which is less user-friendly and may pose challenges for certain users, such as those with visual impairments or limited writing and reading abilities. Therefore, for the first time, this paper investigates the potential of voice-based CRS (VCRSs) to revolutionize the way users interact with RSs in a natural, intuitive, convenient, and accessible fashion. To support such studies, we create two VCRSs benchmark datasets in the e-commerce and movie domains, after realizing the lack of such datasets through an exhaustive literature review. Specifically, we first empirically verify the benefits and necessity of creating such datasets. Thereafter, we convert the user-item interactions to text-based conversations through the ChatGPT-driven prompts for generating diverse and natural templates, and then synthesize the corresponding audios via the text-to-speech model. Meanwhile, a number of strategies are delicately designed to ensure the naturalness and high quality of voice conversations. On this basis, we further explore the potential solutions and point out possible directions to build end-to-end VCRSs by seamlessly extracting and integrating voice-based inputs, thus delivering performance-enhanced, self-explainable, and user-friendly VCRSs. Our study aims to establish the foundation and motivate further pioneering research in the emerging field of VCRSs. This aligns with the principles of explainable AI and AI for social good, viz., utilizing technology's potential to create a fair, sustainable, and just world.

翻译：对话推荐系统（CRS）凭借其通过交互式对话显式获取用户偏好并揭示推荐背后原因的天然优势，已成为推荐系统领域至关重要的新兴研究方向。然而，当前多数CRS基于文本实现，这降低了用户体验友好性，并可能对部分用户（如视力障碍者或读写能力有限者）构成使用障碍。为此，本文首次探索了基于语音的对话推荐系统（VCRS）的可能性——这类系统能以自然、直观、便捷且无障碍的方式彻底改变用户与推荐系统的交互模式。在通过详尽的文献综述发现此类数据集缺失后，我们构建了电商与电影领域两个VCRS基准数据集以支撑相关研究。具体而言：首先通过实证验证了构建此类数据集的必要性与价值；继而利用ChatGPT驱动的提示生成多样化自然模板，将用户-物品交互转化为文本对话，再通过文本转语音模型合成对应音频。与此同时，我们精心设计了多种策略确保语音对话的自然性与高质量。在此基础上，我们进一步探索了潜在解决方案，并指出了通过无缝提取与整合语音输入来构建端到端VCRS的可能路径，从而打造性能增强、可自解释且用户友好的VCRS。本研究旨在为VCRS这一新兴领域奠定基础并激励后续开创性研究，这与可解释人工智能及人工智能向善的理念相契合——即利用技术潜能缔造公平、可持续且公正的世界。