In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art.
翻译:本文表明,当使用视觉归因训练目标训练自监督语音模型时,能够涌现出捕获音节单元的表示。我们证明,采用掩码语言建模损失训练的几乎相同的模型架构(HuBERT)并未展现出同样的能力,这表明视觉归因目标正是这一现象涌现的原因。我们提出使用最小割算法自动预测语音中的音节边界,随后采用两阶段聚类方法将相同音节分组。研究表明,我们的模型不仅在训练语言(英语)上超越了最先进的音节分割方法,还能以零样本方式泛化至爱沙尼亚语。最后,我们发现同一模型能够对零资源挑战赛中的其他4种语言进行零样本分词任务泛化,并在某些情况下超越了此前最优水平。