Recent advances in large language models (LLMs) have transformed open-domain question answering, yet their effectiveness in music-related reasoning remains limited due to sparse music knowledge in pretraining data. While music information retrieval and computational musicology have explored structured and multimodal understanding, few resources support factual and contextual music question answering (MQA) grounded in artist metadata or historical context. We introduce MusWikiDB, a vector database of 3.2M passages from 144K music-related Wikipedia pages, and ArtistMus, a benchmark of 1,000 questions on 500 diverse artists with metadata such as genre, debut year, and topic. These resources enable systematic evaluation of retrieval-augmented generation (RAG) for MQA. Experiments show that RAG markedly improves factual accuracy; open-source models gain up to +56.8 percentage points (for example, Qwen3 8B improves from 35.0 to 91.8), approaching proprietary model performance. RAG-style fine-tuning further boosts both factual recall and contextual reasoning, improving results on both in-domain and out-of-domain benchmarks. MusWikiDB also yields approximately 6 percentage points higher accuracy and 40% faster retrieval than a general-purpose Wikipedia corpus. We release MusWikiDB and ArtistMus to advance research in music information retrieval and domain-specific question answering, establishing a foundation for retrieval-augmented reasoning in culturally rich domains such as music.
翻译:近年来,大型语言模型(LLM)的进展已经改变了开放域问答领域,但由于预训练数据中音乐知识的稀疏性,其在音乐相关推理任务上的有效性仍然有限。尽管音乐信息检索和计算音乐学领域已探索了结构化和多模态理解,但基于艺术家元数据或历史背景的事实性与情境性音乐问答(MQA)资源仍然匮乏。本文介绍了MusWikiDB,一个包含来自14.4万篇音乐相关维基百科页面的320万段落的向量数据库,以及ArtistMus,一个包含500位不同艺术家的1000个问题的基准数据集,这些艺术家附带流派、出道年份和主题等元数据。这些资源为系统评估检索增强生成(RAG)在MQA中的应用提供了支持。实验表明,RAG显著提高了事实准确性;开源模型获得了高达+56.8个百分点的提升(例如,Qwen3 8B从35.0提升至91.8),接近专有模型的性能。RAG风格的微调进一步增强了事实召回和情境推理能力,在领域内和领域外基准测试中均取得了更好的结果。与通用维基百科语料库相比,MusWikiDB还实现了约6个百分点的准确率提升和40%的检索速度提升。我们公开发布MusWikiDB和ArtistMus,以推动音乐信息检索和领域特定问答的研究,为音乐等文化丰富领域的检索增强推理奠定基础。