We revisit a self-supervised method that segments unlabelled speech into word-like segments. We start from the two-stage duration-penalised dynamic programming method that performs zero-resource segmentation without learning an explicit lexicon. In the first acoustic unit discovery stage, we replace contrastive predictive coding features with HuBERT. After word segmentation in the second stage, we get an acoustic word embedding for each segment by averaging HuBERT features. These embeddings are clustered using K-means to get a lexicon. The result is good full-coverage segmentation with a lexicon that achieves state-of-the-art performance on the ZeroSpeech benchmarks.
翻译:我们重新审视了一种自监督方法,该方法能够将未标注的语音切分为类似单词的片段。我们从两阶段持续时间惩罚动态规划方法出发,该方法无需显式学习词汇表即可实现零资源分割。在第一阶段的声学单元发现阶段,我们使用HuBERT特征替代了对比预测编码特征。在第二阶段的单词分割完成后,我们通过平均HuBERT特征为每个片段获取声学词嵌入。利用K-means对这些嵌入进行聚类以构建词汇表。结果表明,该方法实现了良好的全覆盖分割,所构建的词汇表在ZeroSpeech基准测试中达到了最先进的性能。