The excellent generalization ability of self-supervised learning (SSL) for speech foundation models has garnered significant attention. HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete units for a masked language modeling pretext task. However, simply clustering features as targets by k-means does not fully inspire the model's performance. In this work, we present an unsupervised method to improve SSL targets. Two models are proposed, MonoBERT and PolyBERT, which leverage context-independent and context-dependent phoneme-based units for pre-training. Our models outperform other SSL models significantly on the LibriSpeech benchmark without the need for iterative re-clustering and re-training. Furthermore, our models equipped with context-dependent units even outperform target-improvement models that use labeled data during pre-training. How we progressively improve the unit discovery process is demonstrated through experiments.
翻译:自监督学习(SSL)在语音基础模型中展现出的卓越泛化能力引起了广泛关注。HuBERT作为成功典范,通过离线聚类将语音特征转化为离散单元,用于掩码语言建模预训练任务。然而,仅凭k-means聚类作为目标无法充分激发模型性能。本文提出一种无监督方法来改进SSL目标。我们设计了两类模型——MonoBERT与PolyBERT,分别利用上下文无关与上下文相关的音素单元进行预训练。无需迭代式重聚类与重训练,我们的模型在LibriSpeech基准测试中显著超越其他SSL模型。更值得注意的是,采用上下文相关单元的模型甚至超越了在预训练阶段使用标注数据的目标改进模型。实验逐步揭示了如何优化单元发现过程。