Music emotion recognition (MER) aims to identify the emotions conveyed in a given musical piece. But currently in the field of MER, the available public datasets have limited sample sizes. Recently, segment-based methods for emotion-related tasks have been proposed, which train backbone networks on shorter segments instead of entire audio clips, thereby naturally augmenting training samples without requiring additional resources. Then, the predicted segment-level results are aggregated to obtain the entire song prediction. The most commonly used method is that segment inherits the label of the clip containing it, but music emotion is not constant during the whole clip. Doing so will introduce label noise and make the training overfit easily. To handle the noisy label issue, we propose a semi-supervised self-learning (SSSL) method, which can differentiate between samples with correct and incorrect labels in a self-learning manner, thus effectively utilizing the augmented segment-level data. Experiments on three public emotional datasets demonstrate that the proposed method can achieve better or comparable performance.
翻译:音乐情感识别(MER)旨在识别给定音乐片段所传达的情感。然而,当前在MER领域中,可用的公开数据集样本规模有限。近期,针对情感相关任务提出了基于片段的方法,该方法在较短的片段而非完整音频片段上训练骨干网络,从而无需额外资源即可自然地扩充训练样本。随后,通过聚合片段级别的预测结果来获得整首歌曲的预测。最常用的方法是让片段继承其所属音频片段的标签,但音乐情感在整个音频片段中并非恒定不变。这样做会引入标签噪声,并容易导致训练过拟合。为处理噪声标签问题,我们提出了一种半监督自学习(SSSL)方法,该方法能够以自学习的方式区分标签正确与错误的样本,从而有效利用增强的片段级别数据。在三个公开情感数据集上的实验表明,所提方法能够取得更好或相当的性能。