We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.
翻译:本文提出CALM——一种面向多说话人自动语音识别(ASR)的联合上下文声学-语言建模框架。在个性化人工智能场景中,声学与语言线索的共存性自然驱动了目标说话人条件化与重叠对话中上下文偏置的融合。CALM通过说话人嵌入驱动的目标说话人提取和基于动态词汇表的上下文偏置,在端到端框架中实现了该融合机制。我们在模拟英语数据集(LibriSpeechMix)和日语数据集(自发性日语混合语料库CSJMix)上评估CALM性能。在双说话人混合语音上,CALM将LibriSpeech2Mix的偏置词错误率(B-WER)从12.7降至4.7,并将CSJMix2(eval3)的偏置字符错误率(B-CER)从16.6降至8.4,证明了跨语言联合声学-语言建模的有效性。我们同时报告了在AMI语料库(IHM混合条件)上的实验结果,以验证模型在标准化语音混合场景中的性能表现。