Automatic speech recognition (ASR) still covers only a small fraction of the world's languages, mainly due to supervised data scarcity. In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work largely focuses on high-resource languages covered during training and text-only settings. This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL), and how this learning can be used to improve ASR. We conduct experiments with two speech LLMs, Phi-4 and Qwen3-Omni, on three diverse endangered languages. Firstly, we find that MICL is effective for unseen languages, leveraging both speech and text modalities. We further show that cross-lingual transfer learning improves MICL efficiency on target languages without training on them. Moreover, we analyze attention patterns to interpret MICL mechanisms, and we observe layer-dependent preferences between audio and text context, with an overall bias towards text. Finally, we show that prompt-based ASR with speech LLMs performs poorly on unseen languages, motivating a simple ASR system that combines a stronger acoustic model with a speech LLM via MICL-based selection of acoustic hypotheses. Results show that MICL consistently improves ASR performance, and that cross-lingual transfer learning matches or outperforms corpus-trained language models without using target-language data. Our code is publicly available.
翻译:自动语音识别(ASR)目前仅能覆盖全球语言中的一小部分,这主要归因于监督数据的稀缺。基于大型语言模型(LLM)的上下文学习(ICL)为解决该问题提供了途径,但先前的研究主要集中于训练过程中已覆盖的高资源语言及纯文本场景。本文探究了语音LLM能否通过多模态上下文学习(MICL)来学习未见语言,以及如何利用这种学习来改进ASR。我们在三种不同的濒危语言上,使用Phi-4和Qwen3-Omni两种语音LLM进行了实验。首先,我们发现MICL对未见语言是有效的,能够同时利用语音和文本模态。我们进一步证明,跨语言迁移学习可在不对目标语言进行训练的情况下,提升MICL在该语言上的效率。此外,我们通过分析注意力模式来解释MICL的机制,观察到音频与文本上下文之间存在依赖于网络层的偏好,且整体上偏向于文本。最后,我们发现基于提示的ASR在语音LLM上对未见语言表现不佳,这促使我们构建了一个简单的ASR系统:该系统通过基于MICL的声学假设选择,将更强的声学模型与语音LLM相结合。实验结果表明,MICL能持续提升ASR性能,且跨语言迁移学习在不使用目标语言数据的情况下,其表现达到或超越了基于语料库训练的语言模型。我们的代码已公开。