MuseChat: A Conversational Music Recommendation System for Videos

We introduce MuseChat, an innovative dialog-based music recommendation system. This unique platform not only offers interactive user engagement but also suggests music tailored for input videos, so that users can refine and personalize their music selections. In contrast, previous systems predominantly emphasized content compatibility, often overlooking the nuances of users' individual preferences. For example, all the datasets only provide basic music-video pairings or such pairings with textual music descriptions. To address this gap, our research offers three contributions. First, we devise a conversation-synthesis method that simulates a two-turn interaction between a user and a recommendation system, which leverages pre-trained music tags and artist information. In this interaction, users submit a video to the system, which then suggests a suitable music piece with a rationale. Afterwards, users communicate their musical preferences, and the system presents a refined music recommendation with reasoning. Second, we introduce a multi-modal recommendation engine that matches music either by aligning it with visual cues from the video or by harmonizing visual information, feedback from previously recommended music, and the user's textual input. Third, we bridge music representations and textual data with a Large Language Model(Vicuna-7B). This alignment equips MuseChat to deliver music recommendations and their underlying reasoning in a manner resembling human communication. Our evaluations show that MuseChat surpasses existing state-of-the-art models in music retrieval tasks and pioneers the integration of the recommendation process within a natural language framework.

翻译：我们提出MuseChat，一种创新的基于对话的音乐推荐系统。该独特平台不仅支持交互式用户参与，还能为输入视频推荐适配音乐，使用户能够优化和个性化其音乐选择。相比之下，现有系统主要强调内容兼容性，往往忽视用户个体偏好的细微差异。例如，所有数据集仅提供基础的音乐-视频配对或带有文本音乐描述的此类配对。为填补这一空白，本研究的贡献包含三个方面：首先，我们设计了一种对话合成方法，模拟用户与推荐系统之间的两轮交互过程，该方法利用预训练的音乐标签和艺术家信息。在此交互中，用户向系统提交视频，系统随后推荐合适的音乐片段并附上理由；之后用户表达其音乐偏好，系统则给出经过优化的音乐推荐及其推理依据。其次，我们引入了一种多模态推荐引擎，该引擎既可通过将音乐与视频视觉线索对齐进行匹配，也可通过协调视觉信息、先前推荐音乐的反馈以及用户文本输入进行匹配。第三，我们利用大型语言模型(Vicuna-7B)桥接音乐表征与文本数据。这种对齐使MuseChat能够以类似人类交流的方式提供音乐推荐及其潜在推理。评估结果表明，MuseChat在音乐检索任务中超越了现有最先进模型，并开创了在自然语言框架内集成推荐流程的先河。