SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Benefiting from massive and diverse data sources, speech foundation models exhibit strong generalization and knowledge transfer capabilities to a wide range of downstream tasks. However, a limitation arises from their exclusive handling of single-speaker speech input, making them ineffective in recognizing multi-speaker overlapped speech, a common occurrence in real-world scenarios. In this study, we delve into the adaptation of speech foundation models to eliminate interfering speakers from overlapping speech and perform target-speaker automatic speech recognition (TS-ASR). Initially, we utilize the Whisper model as the foundation for adaptation and conduct a thorough comparison of its integration with existing target-speaker adaptation techniques. We then propose an innovative model termed Speaker-Querying Whisper (SQ-Whisper), which employs a set number of trainable queries to capture speaker prompts from overlapping speech based on target-speaker enrollment. These prompts serve to steer the model in extracting speaker-specific features and accurately recognizing target-speaker transcriptions. Experimental results demonstrate that our approach effectively adapts the pre-trained speech foundation model to TS-ASR. Compared with the robust TS-HuBERT model, the proposed SQ-Whisper significantly improves performance, yielding up to 15% and 10% relative reductions in word error rates (WERs) on the Libri2Mix and WSJ0-2Mix datasets, respectively. With data augmentation, we establish new state-of-the-art WERs of 14.6% on the Libri2Mix Test set and 4.4% on the WSJ0-2Mix Test set. Furthermore, we evaluate our model on the real-world AMI meeting dataset, which shows consistent improvement over other adaptation methods.

翻译：得益于海量且多样化的数据源，语音基础模型展现出强大的泛化能力以及向广泛下游任务迁移知识的能力。然而，其局限性在于仅能处理单说话人语音输入，导致无法有效识别现实场景中常见的多说话人重叠语音。本研究深入探讨了如何将语音基础模型适配于从重叠语音中消除干扰说话人，并执行目标说话人自动语音识别（TS-ASR）。首先，我们采用 Whisper 模型作为适配的基础，并对其与现有目标说话人适配技术的集成进行了全面比较。随后，我们提出了一种创新模型，称为说话人查询 Whisper（SQ-Whisper）。该模型使用一组固定数量的可训练查询，基于目标说话人注册信息从重叠语音中捕获说话人提示。这些提示用于引导模型提取说话人特定特征并准确识别目标说话人的转录文本。实验结果表明，我们的方法能有效将预训练的语音基础模型适配于 TS-ASR 任务。与鲁棒的 TS-HuBERT 模型相比，所提出的 SQ-Whisper 显著提升了性能，在 Libri2Mix 和 WSJ0-2Mix 数据集上分别实现了高达 15% 和 10% 的词错误率（WER）相对降低。通过数据增强，我们在 Libri2Mix 测试集和 WSJ0-2Mix 测试集上分别取得了 14.6% 和 4.4% 的最新最优 WER。此外，我们在真实世界的 AMI 会议数据集上评估了我们的模型，结果显示其性能持续优于其他适配方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日