In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art.
翻译:本文展示,在视觉基础的训练目标下训练自监督语音模型时,能够涌现出捕捉音节单元的表示。我们证明,使用几乎相同的模型架构(HuBERT)但以掩码语言建模损失训练时,并未展现出相同的能力,这表明视觉基础目标是该现象涌现的原因。我们提出使用最小割算法自动预测语音中的音节边界,随后采用两阶段聚类方法将相同音节分组。结果显示,我们的模型不仅在训练语言(英语)上优于最先进的音节分割方法,还能以零样本方式泛化至爱沙尼亚语。最后,我们证明该模型能在零样本挑战的4种其他语言上实现词分割任务的零样本泛化,部分情况下超越了先前的最先进水平。