Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL models pre-trained on music recordings may have been mostly closed-sourced, recent speech models such as wav2vec2.0 have shown promise in music modelling. Nevertheless, research exploring the effectiveness of applying speech SSL models to music recordings has been limited. We explore the music adaption of SSL with two distinctive speech-related models, data2vec1.0 and Hubert, and refer to them as music2vec and musicHuBERT, respectively. We train $12$ SSL models with 95M parameters under various pre-training configurations and systematically evaluate the MIR task performances with 13 different MIR tasks. Our findings suggest that training with music data can generally improve performance on MIR tasks, even when models are trained using paradigms designed for speech. However, we identify the limitations of such existing speech-oriented designs, especially in modelling polyphonic information. Based on the experimental results, empirical suggestions are also given for designing future musical SSL strategies and paradigms.
翻译:自监督学习(SSL)在语音和自然语言处理等应用中已展现出令人鼓舞的成果。然而,其在音乐信息检索(MIR)中的有效性仍相对未经探索。尽管此前在音乐录音上预训练的SSL模型大多为闭源,但诸如wav2vec2.0等近期语音模型已在音乐建模中显示出潜力。然而,探索将语音SSL模型应用于音乐录音之有效性的研究仍十分有限。我们基于两种独特的语音相关模型data2vec1.0和Hubert,研究了SSL在音乐中的适配,并分别称之为music2vec和musicHuBERT。我们训练了12个具有9500万参数的SSL模型,采用多种预训练配置,系统评估了13项不同MIR任务下的性能。研究结果表明,即使模型使用为语音设计的训练范式,采用音乐数据进行训练通常也能提升MIR任务的表现。然而,我们发现了现有语音导向设计的局限性,尤其是在建模复音信息方面。基于实验结果,本文为未来音乐SSL策略与范式的设计提出了实证性建议。