Automatic singing voice understanding tasks, such as singer identification, singing voice transcription, and singing technique classification, benefit from data-driven approaches that utilize deep learning techniques. These approaches work well even under the rich diversity of vocal and noisy samples owing to their representation ability. However, the limited availability of labeled data remains a significant obstacle to achieving satisfactory performance. In recent years, self-supervised learning models (SSL models) have been trained using large amounts of unlabeled data in the field of speech processing and music classification. By fine-tuning these models for the target tasks, comparable performance to conventional supervised learning can be achieved with limited training data. Therefore, in this paper, we investigate the effectiveness of SSL models for various singing voice recognition tasks. We report the results of experiments comparing SSL models for three different tasks (i.e., singer identification, singing voice transcription, and singing technique classification) as initial exploration and aim to discuss these findings. Experimental results show that each SSL model achieves comparable performance and sometimes outperforms compared to state-of-the-art methods on each task. We also conducted a layer-wise analysis to further understand the behavior of the SSL models.
翻译:自动歌唱声音理解任务,如歌手识别、歌唱声音转录和歌唱技巧分类,得益于利用深度学习技术的数据驱动方法。这些方法因其表示能力,即使在丰富多样的声音和嘈杂样本中也能良好工作。然而,标注数据有限仍是实现满意性能的主要障碍。近年来,自监督学习模型(SSL模型)在语音处理和音乐分类领域通过大量无标注数据进行了训练。通过针对目标任务对这些模型进行微调,可以在有限训练数据下实现与传统监督学习相当的性能。因此,本文研究了SSL模型在多种歌唱声音识别任务中的有效性。我们报告了将SSL模型应用于三项不同任务(即歌手识别、歌唱声音转录和歌唱技巧分类)的实验结果作为初步探索,并旨在讨论这些发现。实验结果表明,每个SSL模型都达到了与各任务最新方法相当的性能,有时甚至更优。我们还进行了逐层分析,以进一步理解SSL模型的行为。