We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.
翻译:我们提出CALM,一种面向多说话人自动语音识别(ASR)的联合上下文声学-语言建模框架。在个性化人工智能场景中,声学线索与语言线索的联合可获取性自然驱动了重叠对话中目标说话人条件化与上下文偏置的整合。CALM通过说话人嵌入驱动的目标说话人提取与基于动态词汇的上下文偏置,在端到端框架中实现该整合。我们在模拟英语(LibriSpeechMix)与日语(日语自发语音语料库混合,CSJMix)上评估CALM。在双说话人混合场景中,CALM将偏置词错误率(B-WER)从12.7降至4.7(LibriSpeech2Mix),偏置字符错误率(B-CER)从16.6降至8.4(CSJMix2的eval3子集),证明了跨语言的联合声学-语言建模有效性。我们额外报告AMI语料库(IHM-mix条件)的结果以验证在标准化语音混合场景下的性能。