Automatic singing voice understanding tasks, such as singer identification, singing voice transcription, and singing technique classification, benefit from data-driven approaches that utilize deep learning techniques. These approaches work well even under the rich diversity of vocal and noisy samples owing to their representation ability. However, the limited availability of labeled data remains a significant obstacle to achieving satisfactory performance. In recent years, self-supervised learning models (SSL models) have been trained using large amounts of unlabeled data in the field of speech processing and music classification. By fine-tuning these models for the target tasks, comparable performance to conventional supervised learning can be achieved with limited training data. Therefore, in this paper, we investigate the effectiveness of SSL models for various singing voice recognition tasks. We report the results of experiments comparing SSL models for three different tasks (i.e., singer identification, singing voice transcription, and singing technique classification) as initial exploration and aim to discuss these findings. Experimental results show that each SSL model achieves comparable performance and sometimes outperforms compared to state-of-the-art methods on each task. We also conducted a layer-wise analysis to further understand the behavior of the SSL models.
翻译:自动歌唱语音理解任务,如歌手识别、歌唱语音转录及歌唱技巧分类,受益于利用深度学习技术的数据驱动方法。得益于其表征能力,即使面对声音多样性和噪声样本的丰富变异性,这些方法仍能良好运行。然而,标注数据有限仍是实现满意性能的主要障碍。近年来,自监督学习模型(SSL模型)在语音处理与音乐分类领域通过大量无标签数据进行了预训练。通过针对目标任务微调这些模型,可在训练数据有限的情况下获得与常规监督学习相当的性能。因此,本文探究了SSL模型在多种歌唱语音识别任务中的有效性。我们报告了针对三项不同任务(即歌手识别、歌唱语音转录及歌唱技巧分类)的比较实验结果,作为初步探索,并旨在讨论这些发现。实验结果表明,每项任务中,各SSL模型均能达到与当前最优方法相当的性能,部分情况下甚至更优。我们还进行了分层分析,以进一步理解SSL模型的行为特性。