Benefiting from massive and diverse data sources, speech foundation models exhibit strong generalization and knowledge transfer capabilities to a wide range of downstream tasks. However, a limitation arises from their exclusive handling of single-speaker speech input, making them ineffective in recognizing multi-speaker overlapped speech, a common occurrence in real-world scenarios. In this study, we delve into the adaptation of speech foundation models to eliminate interfering speakers from overlapping speech and perform target-speaker automatic speech recognition (TS-ASR). Initially, we utilize the Whisper model as the foundation for adaptation and conduct a thorough comparison of its integration with existing target-speaker adaptation techniques. We then propose an innovative model termed Speaker-Querying Whisper (SQ-Whisper), which employs a set number of trainable queries to capture speaker prompts from overlapping speech based on target-speaker enrollment. These prompts serve to steer the model in extracting speaker-specific features and accurately recognizing target-speaker transcriptions. Experimental results demonstrate that our approach effectively adapts the pre-trained speech foundation model to TS-ASR. Compared with the robust TS-HuBERT model, the proposed SQ-Whisper significantly improves performance, yielding up to 15% and 10% relative reductions in word error rates (WERs) on the Libri2Mix and WSJ0-2Mix datasets, respectively. With data augmentation, we establish new state-of-the-art WERs of 14.6% on the Libri2Mix Test set and 4.4% on the WSJ0-2Mix Test set. Furthermore, we evaluate our model on the real-world AMI meeting dataset, which shows consistent improvement over other adaptation methods.
翻译:得益于海量且多样化的数据源,语音基础模型展现出强大的泛化能力以及向广泛下游任务迁移知识的能力。然而,其局限性在于仅能处理单说话人语音输入,导致无法有效识别现实场景中常见的多说话人重叠语音。本研究深入探讨了如何将语音基础模型适配于从重叠语音中消除干扰说话人,并执行目标说话人自动语音识别(TS-ASR)。首先,我们采用 Whisper 模型作为适配的基础,并对其与现有目标说话人适配技术的集成进行了全面比较。随后,我们提出了一种创新模型,称为说话人查询 Whisper(SQ-Whisper)。该模型使用一组固定数量的可训练查询,基于目标说话人注册信息从重叠语音中捕获说话人提示。这些提示用于引导模型提取说话人特定特征并准确识别目标说话人的转录文本。实验结果表明,我们的方法能有效将预训练的语音基础模型适配于 TS-ASR 任务。与鲁棒的 TS-HuBERT 模型相比,所提出的 SQ-Whisper 显著提升了性能,在 Libri2Mix 和 WSJ0-2Mix 数据集上分别实现了高达 15% 和 10% 的词错误率(WER)相对降低。通过数据增强,我们在 Libri2Mix 测试集和 WSJ0-2Mix 测试集上分别取得了 14.6% 和 4.4% 的最新最优 WER。此外,我们在真实世界的 AMI 会议数据集上评估了我们的模型,结果显示其性能持续优于其他适配方法。