Existing Self-Supervised Learning (SSL) models for speech typically process speech signals at a fixed resolution of 20 milliseconds. This approach overlooks the varying informational content present at different resolutions in speech signals. In contrast, this paper aims to incorporate multi-resolution information into speech self-supervised representation learning. We introduce a SSL model that leverages a hierarchical Transformer architecture, complemented by HuBERT-style masked prediction objectives, to process speech at multiple resolutions. Experimental results indicate that the proposed model not only achieves more efficient inference but also exhibits superior or comparable performance to the original HuBERT model over various tasks. Specifically, significant performance improvements over the original HuBERT have been observed in fine-tuning experiments on the LibriSpeech speech recognition benchmark as well as in evaluations using the Speech Universal PERformance Benchmark (SUPERB) and Multilingual SUPERB (ML-SUPERB).
翻译:现有语音自监督学习(SSL)模型通常以20毫秒的固定分辨率处理语音信号。这种方法忽视了语音信号中不同分辨率下包含的多样化信息内容。相比之下,本文旨在将多分辨率信息融入语音自监督表示学习中。我们提出了一种SSL模型,利用层级式Transformer架构,并结合HuBERT风格的掩码预测目标,以多分辨率方式处理语音。实验结果表明,所提模型不仅实现了更高效的推理,而且在各种任务上展现出优于或相当于原始HuBERT模型的性能。具体而言,在LibriSpeech语音识别基准的微调实验以及使用语音通用性能基准(SUPERB)和多语言SUPERB(ML-SUPERB)的评估中,相较于原始HuBERT模型,我们观察到了显著的性能提升。