Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs

Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pre-training method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at https://github.com/ahaliassos/usr.

翻译：听觉、视觉与视听语音识别（分别对应ASR、VSR与AVSR）的研究传统上各自独立开展。即便是近期尝试同时处理其中两项或全部三项任务的自我监督研究，也往往产生独立模型，导致推理流程割裂、内存需求增加且存在冗余。本文针对这些系统提出了统一的训练策略。我们证明，使用单一模型完成所有三项任务的训练能够提升VSR与AVSR性能，并克服了从头训练时常见的优化难题。此外，我们引入了一种贪婪伪标记方法，以更有效地利用未标注样本，从而改进相关自我监督方法的不足。最后，我们在该框架内开发了一种自我监督预训练方法，验证了其与我们半监督方法协同的有效性。尽管对所有任务使用单一模型，我们的统一方法在LRS3与LRS2数据集上针对ASR、VSR及AVSR任务，以及在新发布的WildVSR数据集上，均取得了优于近期方法的先进性能。代码与模型已发布于https://github.com/ahaliassos/usr。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日