Conversational recommendation has advanced rapidly with large language models (LLMs), yet music remains a uniquely challenging domain in which effective recommendations require reasoning over audio content beyond what text or metadata can capture. We present MusiCRS, the first benchmark for audio-centric conversational recommendation that links authentic user conversations from Reddit with corresponding tracks. MusiCRS includes 477 high-quality conversations spanning diverse genres (classical, hip-hop, electronic, metal, pop, indie, jazz), with 3,589 unique musical entities and audio grounding via YouTube links. MusiCRS supports evaluation under three input modality configurations: audio-only, query-only, and audio+query, allowing systematic comparison of audio-LLMs, retrieval models, and traditional approaches. Our experiments reveal that current systems struggle with cross-modal integration, with optimal performance frequently occurring in single-modality settings rather than multimodal configurations. This highlights fundamental limitations in cross-modal knowledge integration, as models excel at dialogue semantics but struggle when grounding abstract musical concepts in audio. To facilitate progress, we release the MusiCRS dataset (https://huggingface.co/datasets/rohan2810/MusiCRS), evaluation code (https://github.com/rohan2810/musiCRS), and comprehensive baselines.
翻译:随着大语言模型(LLMs)的快速发展,对话推荐系统取得了显著进步,然而音乐领域仍是一个独特的挑战性领域,其中有效的推荐需要对音频内容进行推理,这超出了文本或元数据所能捕捉的范围。我们提出了MusiCRS,这是首个面向音频中心对话推荐的基准测试,它将来自Reddit的真实用户对话与相应的音乐曲目关联起来。MusiCRS包含477个高质量对话,涵盖多种音乐流派(古典、嘻哈、电子、金属、流行、独立、爵士),涉及3,589个独特的音乐实体,并通过YouTube链接提供音频基础。MusiCRS支持在三种输入模态配置下进行评估:仅音频、仅查询以及音频+查询,从而允许对音频-LLMs、检索模型和传统方法进行系统比较。我们的实验表明,当前系统在多模态集成方面存在困难,最优性能往往出现在单模态设置中,而非多模态配置。这凸显了跨模态知识集成的基本局限性,因为模型擅长处理对话语义,但在将抽象音乐概念与音频基础关联时却面临挑战。为促进该领域发展,我们发布了MusiCRS数据集(https://huggingface.co/datasets/rohan2810/MusiCRS)、评估代码(https://github.com/rohan2810/musiCRS)以及全面的基线模型。