Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models

from arxiv, Accepted to IEEE ICASSP 2026 (camera-ready version). Project website (code and model weights): https://umbertocappellazzo.github.io/Omni-AVSR/

Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-based approaches typically address each task independently, training separate models that raise computational and deployment resource use while missing potential cross-task synergies. They also rely on fixed-rate token compression, which restricts flexibility in balancing accuracy with efficiency. These limitations highlight the need for a unified framework that can support ASR, VSR, and AVSR while enabling elastic inference. To this end, we present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation. Specifically, we adapt the matryoshka representation learning paradigm to efficiently train across multiple audio and visual granularities, reducing its inherent training resource use. Furthermore, we explore three LoRA-based strategies for adapting the backbone LLM, balancing shared and task-specific specialization. Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines while training a single model at substantially lower training and deployment resource use. The model also remains robust under acoustic noise, and we analyze its scaling behavior as LLM size increases, providing insights into the trade-off between performance and efficiency.

翻译：大型语言模型（LLM）近期在多种模态的语音识别任务中取得了令人瞩目的成果，包括听觉语音识别（ASR）、视觉语音识别（VSR）以及视听语音识别（AVSR）。尽管取得了这些进展，当前基于LLM的方法通常独立处理每项任务，需要训练独立的模型，这不仅增加了计算和部署的资源消耗，也错失了潜在的跨任务协同效应。此外，这些方法依赖于固定速率的令牌压缩，限制了在准确性与效率之间进行灵活权衡的能力。这些局限性凸显了对一个能够同时支持ASR、VSR和AVSR并实现弹性推理的统一框架的需求。为此，我们提出了Omni-AVSR，一个统一的视听大型语言模型，它结合了高效的多粒度训练与参数高效的适配方法。具体而言，我们采用套娃表示学习范式，以高效地在多种音频和视觉粒度上进行训练，从而降低其固有的训练资源消耗。此外，我们探索了三种基于LoRA的策略来适配骨干LLM，以平衡共享的与任务特定的专业化。在LRS2和LRS3数据集上的实验表明，Omni-AVSR在仅训练单一模型且显著降低训练和部署资源消耗的情况下，达到了与最先进基线模型相当或更优的准确率。该模型在声学噪声下也保持鲁棒性，并且我们分析了其随LLM规模增大的扩展行为，为性能与效率之间的权衡提供了见解。