While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.
翻译:尽管基于大语言模型(LLM)的自动语音识别(ASR)技术实现了无缝的多语言使用,但模型常常错误地识别输出语言,从而损害转录保真度和下游应用质量。为了保持灵活性和语种切换能力,我们提出了一种软提示方法,该方法暗示可能的语种而不严格约束输出。我们正式将这一挑战定义为语种遵循缺失问题,提出了一种新的指标来量化违规行为,并评估了三种缓解策略:(1)在不确定性下提供稳健指导的零样本提示;(2)通过监督微调(SFT)提升提示遵循度;(3)利用思维链(CoT)推理在解码过程中强制遵循语种。我们跨多种语言对这些方法进行了比较分析,评估了它们在减少语种违规同时保持整体ASR性能方面的有效性。最后,我们讨论了不同计算约束下策略选择的权衡因素。