Recent progress in text-based Large Language Models (LLMs) and their extended ability to process multi-modal sensory data have led us to explore their applicability in addressing music information retrieval (MIR) challenges. In this paper, we use a systematic prompt engineering approach for LLMs to solve MIR problems. We convert the music data to symbolic inputs and evaluate LLMs' ability in detecting annotation errors in three key MIR tasks: beat tracking, chord extraction, and key estimation. A concept augmentation method is proposed to evaluate LLMs' music reasoning consistency with the provided music concepts in the prompts. Our experiments tested the MIR capabilities of Generative Pre-trained Transformers (GPT). Results show that GPT has an error detection accuracy of 65.20%, 64.80%, and 59.72% in beat tracking, chord extraction, and key estimation tasks, respectively, all exceeding the random baseline. Moreover, we observe a positive correlation between GPT's error finding accuracy and the amount of concept information provided. The current findings based on symbolic music input provide a solid ground for future LLM-based MIR research.
翻译:近年来,基于文本的大语言模型(LLMs)及其扩展的多模态感官数据处理能力,促使我们探索其在解决音乐信息检索(MIR)挑战中的适用性。本文采用一种系统化的提示工程方法,利用LLMs来解决MIR问题。我们将音乐数据转换为符号输入,并评估LLMs在三个关键MIR任务中检测标注错误的能力:节拍跟踪、和弦提取与调性估计。我们提出了一种概念增强方法,用于评估LLMs的音乐推理能力与提示中所提供音乐概念的一致性。我们的实验测试了生成式预训练Transformer(GPT)的MIR能力。结果显示,GPT在节拍跟踪、和弦提取和调性估计任务中的错误检测准确率分别为65.20%、64.80%和59.72%,均超过随机基线。此外,我们观察到GPT的错误发现准确率与所提供的概念信息量呈正相关。目前基于符号音乐输入的发现为未来基于LLM的MIR研究奠定了坚实基础。