Instrument playing techniques (IPTs) constitute a pivotal component of musical expression. However, the development of automatic IPT detection methods suffers from limited labeled data and inherent class imbalance issues. In this paper, we propose to apply a self-supervised learning model pre-trained on large-scale unlabeled music data and finetune it on IPT detection tasks. This approach addresses data scarcity and class imbalance challenges. Recognizing the significance of pitch in capturing the nuances of IPTs and the importance of onset in locating IPT events, we investigate multi-task finetuning with pitch and onset detection as auxiliary tasks. Additionally, we apply a post-processing approach for event-level prediction, where an IPT activation initiates an event only if the onset output confirms an onset in that frame. Our method outperforms prior approaches in both frame-level and event-level metrics across multiple IPT benchmark datasets. Further experiments demonstrate the efficacy of multi-task finetuning on each IPT class.
翻译:乐器演奏技巧(IPTs)是音乐表达的关键组成部分。然而,自动IPT检测方法的发展受限于有限的标注数据和固有的类别不平衡问题。本文提出采用在大规模无标注音乐数据上预训练的自监督学习模型,并针对IPT检测任务进行微调。该方法有效解决了数据稀缺和类别不平衡挑战。鉴于音高对捕捉IPT细微变化的重要性以及起音对定位IPT事件的关键作用,我们探索了将音高检测和起音检测作为辅助任务的多任务微调方法。此外,我们应用了一种事后处理策略用于事件级预测:仅当起音输出确认当前帧存在起音时,IPT激活才被判定为事件。在多个IPT基准数据集上,我们的方法在帧级和事件级指标上均优于现有方法。进一步实验验证了多任务微调对每个IPT类别的有效性。