Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing SSL models could achieve superior performance compared to using one SSL model. However, fusion models have increased model parameter size, leading to longer inference times. In this paper, we propose a novel approach of predicting other SSL models' features from a single SSL model, resulting in a light-weight framework with competitive performance. Our experiments show that SSL feature prediction models outperform individual SSL models in multilingual speech recognition tasks. The leading prediction model achieves an average SUPERB score increase of 135.4 in ML-SUPERB benchmarks. Moreover, our proposed framework offers an efficient solution, as it reduces the resulting model parameter size and inference times compared to previous fusion models.
翻译:自监督学习模型在多种语音任务中展现出卓越性能,尤其在低资源和多语言领域表现突出。近年研究表明,融合多个自监督学习模型可实现优于单一模型的效果。然而,融合模型会导致参数量增加,进而延长推理时间。本文提出了一种新颖方法,通过单一自监督学习模型预测其他自监督学习模型的特征,从而构建出兼具竞争性能的轻量化框架。实验证明,自监督特征预测模型在多语言语音识别任务中优于单一自监督学习模型。其中,最优预测模型在ML-SUPERB基准测试中平均SUPERB分数提升达135.4。此外,与传统融合模型相比,本文提出的框架有效降低了模型参数量并缩短了推理时间,提供了高效的解决方案。