Pre-trained speech encoders have been central to pushing state-of-the-art results across various speech understanding and generation tasks. Nonetheless, the capabilities of these encoders in low-resource settings are yet to be thoroughly explored. To address this, we conduct a comprehensive set of experiments using a representative set of 3 state-of-the-art encoders (Wav2vec2, WavLM, Whisper) in the low-resource setting across 7 speech understanding and generation tasks. We provide various quantitative and qualitative analyses on task performance, convergence speed, and representational properties of the encoders. We observe a connection between the pre-training protocols of these encoders and the way in which they capture information in their internal layers. In particular, we observe the Whisper encoder exhibits the greatest low-resource capabilities on content-driven tasks in terms of performance and convergence speed.
翻译:预训练语音编码器在推动多种语音理解与生成任务达到最优结果方面发挥了核心作用。然而,这些编码器在低资源环境中的能力尚未得到充分探索。为此,我们使用3种有代表性的先进编码器(Wav2vec2、WavLM、Whisper),在7项语音理解与生成任务的低资源设置下开展了一系列全面实验。我们从任务性能、收敛速度和编码器表示特性等方面进行了定量与定性分析。研究发现,这些编码器的预训练协议与其内部层级捕获信息的方式存在关联。特别地,我们观察到在内容驱动任务中,Whisper编码器在性能和收敛速度方面展现出最强的低资源适应能力。