Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field. Our code is available at: https://github.com/nttcslab/m2d/tree/master/speech
翻译:自监督学习的通用音频表示已在多种任务中展现出高性能。尽管这些表示可通过微调针对应用进行优化,但若能针对特定应用进行预训练特化,则可期待更高的性能。本文以需求旺盛的语音领域为例,探讨将通用音频表示特化应用于具体场景的挑战与解决方案。我们增强了通用模型掩码建模对(M2D),以缩小其与先进语音模型之间的性能差距。为此,我们提出了一项新任务——去噪蒸馏,用于从细粒度聚类特征中学习,并提出了面向语音的M2D(M2D-S),该模型联合学习去噪蒸馏任务与M2D掩码预测任务。实验结果表明,M2D-S在SUPERB基准测试中性能与先进语音模型相当或更优,证明M2D可特化于高需求领域。我们的代码已开源:https://github.com/nttcslab/m2d/tree/master/speech