This paper introduces Robust Spin (R-Spin), a data-efficient self-supervised fine-tuning framework for speaker and noise-invariant speech representations by learning discrete acoustic units with speaker-invariant clustering (Spin). R-Spin resolves Spin's issues and enhances content representations by learning to predict acoustic pieces. R-Spin offers a 12X reduction in computational resources compared to previous state-of-the-art methods while outperforming them in severely distorted speech scenarios. This paper provides detailed analyses to show how discrete units contribute to speech encoder training and improving robustness in diverse acoustic environments.
翻译:本文提出鲁棒自旋(R-Spin),一种数据高效的自监督微调框架,通过结合说话人不变聚类(Spin)学习离散声学单元,实现说话人与噪声不变的语音表示。R-Spin解决了Spin的固有问题,并通过预测声学片段增强内容表示。相比先前最先进的方法,R-Spin将计算资源需求降低12倍,同时在严重失真语音场景中表现更优。本文通过详细分析,阐明离散单元如何促进语音编码器训练并提升在多样化声学环境中的鲁棒性。