Music recommendation for videos attracts growing interest in multi-modal research. However, existing systems focus primarily on content compatibility, often ignoring the users' preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat, a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user's preference as inputs and retrieves an appropriate music matching the context. The reasoning module, equipped with the power of Large Language Model (Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat, we build a large-scale dataset, conversational music recommendation for videos, that simulates a two-turn interaction between a user and a recommender based on accurate music track information. Experiment results show that MuseChat achieves significant improvements over existing video-based music retrieval methods as well as offers strong interpretability and interactability.
翻译:视频音乐推荐在多模态研究中日益受到关注。然而,现有系统主要聚焦于内容兼容性,往往忽视用户偏好。由于缺乏与用户进行进一步交互优化或提供解释的能力,用户满意度较低。我们通过MuseChat解决这些问题,该对话式推荐系统首次实现为视频提供个性化的音乐建议。系统包含两大核心功能及其对应模块:推荐模块与推理模块。推荐模块以视频及可选信息(如先前推荐的音乐和用户偏好)为输入,检索与上下文匹配的音乐。推理模块借助大型语言模型(Vicuna-7B)并扩展至多模态输入,能够为推荐音乐提供合理的解释。为评估MuseChat的有效性,我们构建了一个大规模数据集——视频对话式音乐推荐,该数据集基于精确音乐曲目信息模拟用户与推荐器之间的两轮交互。实验结果表明,与现有基于视频的音乐检索方法相比,MuseChat不仅性能显著提升,还具备强大的可解释性与交互性。