Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field. Our code is available at: https://github.com/nttcslab/m2d/tree/master/speech
翻译:自监督学习的通用音频表示已在多种任务中展现出高性能。尽管通过微调可以优化其应用,但若能针对特定应用场景进行预训练专化,则可望获得更优性能。本文以需求旺盛的语音领域为例,探讨了将通用音频表示专化于特定应用所面临的挑战与解决方案。我们改进了通用模型掩码建模对(M2D),以缩小其与语音领域最先进(SOTA)模型之间的性能差距。为此,我们提出了一种新任务——去噪蒸馏,用于从细粒度聚类特征中学习;同时构建了面向语音的M2D(M2D-S),该模型联合学习去噪蒸馏任务与M2D掩码预测任务。实验结果表明,在SUPERB基准上,M2D-S的性能与SOTA语音模型相当或更优,证明了M2D能够在高需求领域实现专化。我们的代码已开源:https://github.com/nttcslab/m2d/tree/master/speech