In music information retrieval (MIR) research, the use of pretrained foundational audio encoders (FAEs) has recently become a trend. FAEs pretrained on large amounts of music and audio data have been shown to improve performance on MIR tasks such as music tagging and automatic music transcription. However, their use for music structure analysis (MSA) remains underexplored: only a small subset of FAEs has been examined for MSA, and the impact of factors such as learning methods, training data, and model context length on MSA performance remains unclear. In this study, we conduct comprehensive experiments on 11 types of FAEs to investigate how these factors affect MSA performance. Our results demonstrate that FAEs using self-supervised learning with masked language modeling on music data are particularly effective for MSA. These findings pave the way for future research in FAE and MSA.
翻译:在音乐信息检索研究中,使用预训练基础音频编码器已成为近期趋势。经大量音乐与音频数据预训练的FAE已被证明能提升音乐标注和自动音乐转录等MIR任务的性能。然而,其在音乐结构分析领域的应用仍待深入探索:目前仅少数FAE变体在MSA任务中得到验证,且学习方法、训练数据、模型上下文长度等因素对MSA性能的影响尚不明确。本研究通过对11类FAE开展系统实验,探究上述因素对MSA性能的作用机制。实验结果表明:采用掩码语言建模自监督学习方法并在音乐数据上训练的FAE对MSA任务尤为有效。这些发现为FAE与MSA领域的后续研究奠定了基础。