Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pre-training method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at https://github.com/ahaliassos/usr.
翻译:听觉、视觉与视听语音识别(分别对应ASR、VSR与AVSR)的研究传统上各自独立开展。即便是近期尝试同时处理其中两项或全部三项任务的自我监督研究,也往往产生独立模型,导致推理流程割裂、内存需求增加且存在冗余。本文针对这些系统提出了统一的训练策略。我们证明,使用单一模型完成所有三项任务的训练能够提升VSR与AVSR性能,并克服了从头训练时常见的优化难题。此外,我们引入了一种贪婪伪标记方法,以更有效地利用未标注样本,从而改进相关自我监督方法的不足。最后,我们在该框架内开发了一种自我监督预训练方法,验证了其与我们半监督方法协同的有效性。尽管对所有任务使用单一模型,我们的统一方法在LRS3与LRS2数据集上针对ASR、VSR及AVSR任务,以及在新发布的WildVSR数据集上,均取得了优于近期方法的先进性能。代码与模型已发布于https://github.com/ahaliassos/usr。