Extraction of predominant pitch from polyphonic audio is one of the fundamental tasks in the field of music information retrieval and computational musicology. To accomplish this task using machine learning, a large amount of labeled audio data is required to train the model. However, a classical model pre-trained on data from one domain (source), e.g., songs of a particular singer or genre, may not perform comparatively well in extracting melody from other domains (target). The performance of such models can be boosted by adapting the model using very little annotated data from the target domain. In this work, we propose an efficient interactive melody adaptation method. Our method selects the regions in the target audio that require human annotation using a confidence criterion based on normalized true class probability. The annotations are used by the model to adapt itself to the target domain using meta-learning. Our method also provides a novel meta-learning approach that handles class imbalance, i.e., a few representative samples from a few classes are available for adaptation in the target domain. Experimental results show that the proposed method outperforms other adaptive melody extraction baselines. The proposed method is model-agnostic and hence can be applied to other non-adaptive melody extraction models to boost their performance. Also, we released a Hindustani Alankaar and Raga (HAR) dataset containing 523 audio files of about 6.86 hours of duration intended for singing melody extraction tasks.
翻译:从多声道音频中提取主旋律是音乐信息检索和计算音乐学领域的基本任务之一。利用机器学习完成此任务需要大量标注音频数据来训练模型。然而,在某一领域(源域)数据上预训练的经典模型(例如特定歌手或风格歌曲)在提取其他领域(目标域)旋律时可能表现不佳。通过使用来自目标域的极少量标注数据对模型进行适应,可以提升此类模型的性能。本文提出了一种高效的交互式旋律适应方法。该方法基于归一化真实类概率的置信度准则,选择目标音频中需要人工标注的区域。模型利用这些标注通过元学习实现对目标域的适应。我们还提出了一种处理类别不平衡的新颖元学习方法,即目标域适应过程中仅涉及少数类别的代表性样本。实验结果表明,所提方法优于其他自适应旋律提取基线方法。该方法不依赖特定模型架构,因此可应用于其他非自适应旋律提取模型以提升其性能。此外,我们发布了一个包含523个音频文件、总时长约6.86小时的印度斯坦Alankaar和Raga(HAR)数据集,用于歌唱旋律提取任务。