Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these frameworks, are also presented. An extensive review of these approaches is then conducted: (1) the effect of the main hyperparameters of SSL frameworks is investigated; (2) the role of SSL components is studied (e.g., data-augmentation, projector, positive sampling); and (3) SSL frameworks are evaluated on SR with in-domain and out-of-domain data, using a consistent experimental setup, and a comprehensive comparison of SSL methods from the literature is provided. Specifically, DINO achieves the best downstream performance and effectively models intra-speaker variability, although it is highly sensitive to hyperparameters and training conditions, while SimCLR and MoCo provide robust alternatives that effectively capture inter-speaker variability and are less prone to collapse. This work aims to highlight recent trends and advancements, identifying current challenges in the field.
翻译:在监督式训练下,深度学习模型已彻底改变了音频与语音处理领域。然而,其性能本质上依赖于人工标注数据的数量,导致扩展成本高昂,且在未见条件下泛化能力较差。为应对这些挑战,自监督学习作为一种有前景的范式应运而生,它利用大量未标注数据来学习相关表征。自监督学习在自动语音识别中的应用已得到广泛研究,但在其他下游任务——尤其是说话人识别——方面的研究仍处于早期阶段。本文描述了最初为计算机视觉开发的主要自监督学习实例不变性框架(如 SimCLR、MoCo 和 DINO)及其在说话人识别中的适配。同时,介绍了文献中基于这些框架提出的多种用于说话人识别的自监督学习方法。随后对这些方法进行了全面评述:(1)探究了自监督学习框架主要超参数的影响;(2)研究了自监督学习组件的作用(如数据增强、投影器、正样本采样);(3)在一致的实验设置下,使用领域内和领域外数据评估了自监督学习框架在说话人识别上的表现,并对文献中的自监督学习方法进行了综合比较。具体而言,DINO 取得了最佳的下游性能,并能有效建模说话人内部变异性,尽管其对超参数和训练条件高度敏感;而 SimCLR 和 MoCo 则提供了稳健的替代方案,能有效捕捉说话人间变异性且不易发生坍缩。本研究旨在突显该领域的最新趋势与进展,并指出当前面临的挑战。