Hearing More with Less: Multi-Modal Retrieval-and-Selection Augmented Conversational LLM-Based ASR

Automatic Speech Recognition (ASR) aims to convert human speech content into corresponding text. In conversational scenarios, effectively utilizing context can enhance its accuracy. Large Language Models' (LLMs) exceptional long-context understanding and reasoning abilities enable LLM-based ASR (LLM-ASR) to leverage historical context for recognizing conversational speech, which has a high degree of contextual relevance. However, existing conversational LLM-ASR methods use a fixed number of preceding utterances or the entire conversation history as context, resulting in significant ASR confusion and computational costs due to massive irrelevant and redundant information. This paper proposes a multi-modal retrieval-and-selection method named MARS that augments conversational LLM-ASR by enabling it to retrieve and select the most relevant acoustic and textual historical context for the current utterance. Specifically, multi-modal retrieval obtains a set of candidate historical contexts, each exhibiting high acoustic or textual similarity to the current utterance. Multi-modal selection calculates the acoustic and textual similarities for each retrieved candidate historical context and, by employing our proposed near-ideal ranking method to consider both similarities, selects the best historical context. Evaluations on the Interspeech 2025 Multilingual Conversational Speech Language Model Challenge dataset show that the LLM-ASR, when trained on only 1.5K hours of data and equipped with the MARS, outperforms the state-of-the-art top-ranking system trained on 179K hours of data.

翻译：自动语音识别（ASR）旨在将人类语音内容转换为相应的文本。在对话场景中，有效利用上下文可提升其识别准确率。大语言模型（LLMs）卓越的长上下文理解与推理能力，使得基于LLM的ASR（LLM-ASR）能够借助历史上下文来识别具有高度语境相关性的对话语音。然而，现有对话式LLM-ASR方法采用固定数量的前序话语或完整对话历史作为上下文，因包含大量无关及冗余信息，导致显著的ASR混淆与计算开销。本文提出一种名为MARS的多模态检索与选择方法，通过使对话式LLM-ASR能够检索并选择与当前话语最相关的声学及文本历史上下文，从而增强其性能。具体而言，多模态检索获取一组候选历史上下文，每个候选上下文在声学或文本层面与当前话语具有高度相似性；多模态选择则计算每个检索到的候选历史上下文的声学与文本相似度，并采用我们提出的近理想排序方法综合考量两种相似度，从而筛选出最优历史上下文。在Interspeech 2025多语言对话语音语言模型挑战数据集上的评估表明，仅使用1.5K小时数据训练并配备MARS的LLM-ASR，其性能优于使用179K小时数据训练的最先进顶级系统。