Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns their representations in a shared embedding space. Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality. Contrastive alignment provides additional gains when applied to different context types, with an overall performance gain of over 5%. These results highlight the importance of both contextual modeling and cross-modal alignment in multilingual ASR.
翻译:自动语音识别(ASR)受益于预训练语音与语言模型的进步,但大多数系统仍局限于单语言场景及简短、孤立的语句。尽管近期语境感知ASR的研究展现出潜力,但两大关键挑战依然存在:多语言支持有限,以及语音与语境表征之间缺乏系统化的对齐机制。本文提出一种支持多种语言和口音、同时保持预训练模型模块化特性的语境感知多语言ASR框架。该方法通过轻量级投影模块结合冻结的语音编码器与仅解码器语言模型,使结构化语境提示(包括对话历史与偏置词)能够引导转录过程。为增强语音与语境的交互,我们采用对比学习目标,将二者的表征对齐至共享嵌入空间。在涵盖11种语言及5种英语方言、总计超过1,500小时的真实对话语音数据上的评估表明,语境输入能持续提升识别质量。对比对齐机制在不同语境类型中均带来额外增益,整体性能提升超过5%。这些结果凸显了语境建模与跨模态对齐在多语言ASR中的重要性。