In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. The source code will be made available after the anonymity period.
翻译:本文提出了用于自监督语音表示学习的自蒸馏与在线聚类方法(DinoSR),该方法融合了掩码语言建模、自蒸馏和在线聚类。研究表明,这些概念相互补充,形成了一种强大的语音表示学习模型。DinoSR首先通过教师网络从输入音频中提取上下文嵌入,随后对嵌入运行在线聚类系统以生成机器发现的音素清单,最终利用离散化标记指导学生网络。实验证明,DinoSR在多个下游任务中超越了先前的最优性能,并对模型及所学离散单元进行了详细分析。源代码将在匿名期结束后公开。