Recent advancements in Self-Supervised Learning (SSL) have shown promising results in Speaker Verification (SV). However, narrowing the performance gap with supervised systems remains an ongoing challenge. Several studies have observed that speech representations from large-scale ASR models contain valuable speaker information. This work explores the limitations of fine-tuning these models for SV using an SSL contrastive objective in an end-to-end approach. Then, we propose a framework to learn speaker representations in an SSL context by fine-tuning a pre-trained WavLM with a supervised loss using pseudo-labels. Initial pseudo-labels are derived from an SSL DINO-based model and are iteratively refined by clustering the model embeddings. Our method achieves 0.99% EER on VoxCeleb1-O, establishing the new state-of-the-art on self-supervised SV. As this performance is close to our supervised baseline of 0.94% EER, this contribution is a step towards supervised performance on SV with SSL.
翻译:自监督学习在说话人验证领域的最新进展已展现出令人瞩目的成果。然而,缩小与监督系统之间的性能差距仍是持续面临的挑战。多项研究发现,大规模自动语音识别模型提取的语音表征蕴含丰富的说话人信息。本研究探讨了在端到端框架下,采用自监督对比学习目标对这些模型进行说话人验证微调的局限性。随后,我们提出一种通过伪标签监督损失微调预训练WavLM模型,在自监督语境下学习说话人表征的框架。初始伪标签源自基于DINO的自监督模型,并通过迭代聚类模型嵌入向量进行优化。我们的方法在VoxCeleb1-O数据集上实现了0.99%的等错误率,创造了自监督说话人验证的最新性能记录。由于该结果与监督基线0.94%的等错误率极为接近,本研究成果标志着自监督学习在说话人验证领域向监督性能迈进的重要一步。