MuseChat: A Conversational Music Recommendation System for Videos

We introduce MuseChat, an innovative dialog-based music recommendation system. This unique platform not only offers interactive user engagement but also suggests music tailored for input videos, so that users can refine and personalize their music selections. In contrast, previous systems predominantly emphasized content compatibility, often overlooking the nuances of users' individual preferences. For example, all the datasets only provide basic music-video pairings or such pairings with textual music descriptions. To address this gap, our research offers three contributions. First, we devise a conversation-synthesis method that simulates a two-turn interaction between a user and a recommendation system, which leverages pre-trained music tags and artist information. In this interaction, users submit a video to the system, which then suggests a suitable music piece with a rationale. Afterwards, users communicate their musical preferences, and the system presents a refined music recommendation with reasoning. Second, we introduce a multi-modal recommendation engine that matches music either by aligning it with visual cues from the video or by harmonizing visual information, feedback from previously recommended music, and the user's textual input. Third, we bridge music representations and textual data with a Large Language Model(Vicuna-7B). This alignment equips MuseChat to deliver music recommendations and their underlying reasoning in a manner resembling human communication. Our evaluations show that MuseChat surpasses existing state-of-the-art models in music retrieval tasks and pioneers the integration of the recommendation process within a natural language framework.

翻译：我们提出MuseChat，一种创新的基于对话的音乐推荐系统。该独特平台不仅提供交互式用户参与，还能为输入视频推荐定制化音乐，使用户能够优化和个性化其音乐选择。相比之下，以往系统主要强调内容兼容性，往往忽视用户个人偏好的细微差别。例如，所有数据集仅提供基础音乐-视频配对，或包含文本音乐描述的此类配对。为填补这一空白，本研究做出三项贡献。首先，我们设计了一种对话合成方法，模拟用户与推荐系统间的两轮交互，该方法利用预训练音乐标签和艺术家信息。在此交互中，用户向系统提交视频，系统随后推荐合适音乐并附上理由；之后用户表达其音乐偏好，系统给出带推理过程的优化推荐。其次，我们提出多模态推荐引擎，通过将音乐与视频视觉线索对齐，或协调视觉信息、先前推荐音乐的反馈及用户文本输入来进行匹配。第三，我们利用大语言模型（Vicuna-7B）桥接音乐表征与文本数据。这种对齐使MuseChat能够以类人交流的方式提供音乐推荐及其背后的推理。评估表明，MuseChat在音乐检索任务上超越现有最优模型，并开创了将推荐过程整合到自然语言框架中的先河。