With the growing amount of musical data available, automatic instrument recognition, one of the essential problems in Music Information Retrieval (MIR), is drawing more and more attention. While automatic recognition of single instruments has been well-studied, it remains challenging for polyphonic, multi-instrument musical recordings. This work presents our efforts toward building a robust end-to-end instrument recognition system for polyphonic multi-instrument music. We train our model using a pre-training and fine-tuning approach: we use a large amount of monophonic musical data for pre-training and subsequently fine-tune the model for the polyphonic ensemble. In pre-training, we apply data augmentation techniques to alleviate the domain gap between monophonic musical data and real-world music. We evaluate our method on the IRMAS testing data, a polyphonic musical dataset comprising professionally-produced commercial music recordings. Experimental results show that our best model achieves a micro F1-score of 0.674 and an LRAP of 0.814, meaning 10.9% and 8.9% relative improvement compared with the previous state-of-the-art end-to-end approach. Also, we are able to build a lightweight model, achieving competitive performance with only 519K trainable parameters.
翻译:随着可用音乐数据量的增长,自动乐器识别作为音乐信息检索(MIR)领域的基本问题之一,正受到越来越多的关注。虽然单一乐器的自动识别已得到充分研究,但针对复调、多乐器音乐录音的识别仍具有挑战性。本文致力于构建一个鲁棒的端到端复调多乐器音乐识别系统。我们采用预训练与微调相结合的方法训练模型:首先利用大量单声道音乐数据进行预训练,随后针对复调合奏场景对模型进行微调。在预训练阶段,我们采用数据增强技术以缩小单声道音乐数据与真实音乐之间的领域差距。我们在IRMAS测试数据(一个包含专业制作商业音乐录音的复调音乐数据集)上评估了该方法。实验结果表明,最优模型实现了0.674的微观F1分数和0.814的LRAP值,相较于先前最先进的端到端方法分别提升了10.9%和8.9%。此外,我们还构建了一个轻量级模型,仅用519K可训练参数即可达到具有竞争力的性能。