Data-driven unit discovery in self-supervised learning (SSL) of speech has embarked on a new era of spoken language processing. Yet, the discovered units often remain in phonetic space and the units beyond phonemes are largely underexplored. Here, we demonstrate that a syllabic organization emerges in learning sentence-level representation of speech. In particular, we adopt "self-distillation" objective to fine-tune the pretrained HuBERT with an aggregator token that summarizes the entire sentence. Without any supervision, the resulting model draws definite boundaries in speech, and the representations across frames exhibit salient syllabic structures. We demonstrate that this emergent structure largely corresponds to the ground truth syllables. Furthermore, we propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech. When compared to previous models, our model outperforms in both unsupervised syllable discovery and learning sentence-level representation. Together, we demonstrate that the self-distillation of HuBERT gives rise to syllabic organization without relying on external labels or modalities, and potentially provides novel data-driven units for spoken language modeling.
翻译:自监督语音学习中的数据驱动单元发现开启了口语处理的新纪元。然而,已发现的单元通常仍局限于音素空间,而音素之上的单元在很大程度上尚未被充分探索。在此,我们证明在学习语音的句子级表征过程中会涌现出音节组织。具体而言,我们采用"自蒸馏"目标,通过一个汇总整个句子的聚合令牌对预训练的HuBERT进行微调。在无任何监督的情况下,所得模型在语音中划定了明确的边界,且跨帧的表征展现出显著音节结构。我们证明这一涌现结构与真实音节高度对应。此外,我们提出了一项新的基准任务——口语语音ABX,用于评估语音的句子级表征。与先前模型相比,我们的模型在无监督音节发现和学习句子级表征方面均表现更优。综上,我们证明HuBERT的自蒸馏在不依赖外部标签或模态的情况下催生了音节组织,并可能为口语建模提供新颖的数据驱动单元。