Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that leverages a hierarchical Transformer architecture, complemented by HuBERT-style masked prediction objectives, to process speech at multiple resolutions. Experimental results indicate that the proposed model not only achieves more efficient inference but also exhibits superior or comparable performance to the original HuBERT model over various tasks. Specifically, significant performance improvements over the original HuBERT have been observed in fine-tuning experiments on the LibriSpeech speech recognition benchmark as well as in evaluations using the Speech Universal PERformance Benchmark (SUPERB) and Multilingual SUPERB (ML-SUPERB).
翻译:现有的语音自监督学习(SSL)模型通常以20毫秒的固定分辨率处理语音信号。这种方法忽略了语音信号中不同分辨率所含信息量的差异。相比之下,本文旨在将多分辨率信息融入语音自监督表示学习。我们提出了一种SSL模型,该模型利用分层Transformer架构,辅以HuBERT风格的掩码预测目标,以多分辨率方式处理语音。实验结果表明,所提模型不仅实现了更高效的推理,而且在各种任务上展现出优于或与原始HuBERT模型相当的性能。具体而言,在LibriSpeech语音识别基准的微调实验以及基于语音通用性能基准(SUPERB)和多语言SUPERB(ML-SUPERB)的评估中,观察到了相较于原始HuBERT模型的显著性能提升。