While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. This paper also introduces Selective Attention (SA) Merge, a novel method that selectively merges task vectors from attention matrices to enhance SFM performance on low-resource tasks. Experiments on the MyST database show significant reductions in relative word error rate of up to 14%, outperforming existing model merging and data augmentation techniques. By combining data augmentation techniques with SA Merge, we achieve a new state-of-the-art WER of 8.69 on the MyST database for the Whisper-small model, highlighting the potential of SA Merge for improving low-resource ASR.
翻译:尽管语音基础模型(SFMs)在多种语音任务中表现出色,但其在低资源任务(如儿童自动语音识别(ASR))上的性能受限于有限的预训练数据。为解决这一问题,我们探索了不同的模型融合技术,以利用在更大、更多样化语音语料库上训练的模型知识。本文还提出了选择性注意力(SA)融合,这是一种新颖的方法,通过选择性融合注意力矩阵中的任务向量来提升SFM在低资源任务上的性能。在MyST数据库上的实验表明,相对词错误率显著降低高达14%,优于现有的模型融合和数据增强技术。通过将数据增强技术与SA融合相结合,我们在MyST数据库上为Whisper-small模型实现了8.69的最新最优词错误率,凸显了SA融合在改进低资源ASR方面的潜力。