Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

from arxiv, Accepted to IEEE ICASSP 2026 (camera-ready version). Project website (code and model weights): https://umbertocappellazzo.github.io/Omni-AVSR/

Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.

翻译：大型语言模型（LLM）近期在涵盖听觉语音识别（ASR）、视觉语音识别（VSR）以及视听语音识别（AVSR）在内的多模态语音识别任务中取得了令人瞩目的成果。尽管取得了这些进展，当前基于LLM的方法通常独立处理每项任务，训练独立的模型，这增加了计算和部署资源消耗，同时错失了潜在的跨任务协同效应。此外，这些方法依赖于固定速率的令牌压缩，限制了在准确性与效率之间进行平衡的灵活性。这些局限性凸显了对一个能够支持ASR、VSR和AVSR并实现弹性推理的统一框架的需求。为此，我们提出了Omni-AVSR，一个统一的视听大型语言模型，它结合了高效的多粒度训练与参数高效的适配。具体而言，我们适配了套娃表示学习范式，以高效地跨多个音频和视觉粒度进行训练，从而降低其固有的训练资源消耗。此外，我们探索了三种基于LoRA的策略来适配骨干LLM，以平衡共享和任务特定的专业化。在LRS2和LRS3数据集上的实验表明，Omni-AVSR在训练单个模型且显著降低训练和部署资源消耗的同时，取得了与最先进基线模型相当或更优的准确率。该模型在声学噪声下也保持鲁棒性，并且我们分析了其随LLM规模增大的扩展行为，为性能与效率之间的权衡提供了见解。