Speech foundation models have achieved state-of-the-art (SoTA) performance across various tasks, such as automatic speech recognition (ASR) in hundreds of languages. However, multi-speaker ASR remains a challenging task for these models due to data scarcity and sparsity. In this paper, we present approaches to enable speech foundation models to process and understand multi-speaker speech with limited training data. Specifically, we adapt a speech foundation model for the multi-speaker ASR task using only telephonic data. Remarkably, the adapted model also performs well on meeting data without any fine-tuning, demonstrating the generalization ability of our approach. We conduct several ablation studies to analyze the impact of different parameters and strategies on model performance. Our findings highlight the effectiveness of our methods. Results show that less parameters give better overall cpWER, which, although counter-intuitive, provides insights into adapting speech foundation models for multi-speaker ASR tasks with minimal annotated data.
翻译:语音基础模型已在数百种语言的自动语音识别等多项任务中取得了最先进的性能。然而,由于数据稀缺性和稀疏性,多说话人自动语音识别对这些模型而言仍是具有挑战性的任务。本文提出了使语音基础模型能够利用有限训练数据处理和理解多说话人语音的方法。具体而言,我们仅使用电话数据将语音基础模型适配于多说话人自动语音识别任务。值得注意的是,未经任何微调的适配模型在会议数据上也表现良好,这证明了我们方法的泛化能力。我们进行了多项消融实验,以分析不同参数和策略对模型性能的影响。我们的研究结果凸显了所提方法的有效性。实验表明,更少的参数能带来更优的整体cpWER性能,这一反直觉的发现为在有限标注数据条件下适配语音基础模型用于多说话人自动语音识别任务提供了重要见解。