Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such audio-centric systems inherently exclude individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Furthermore, our analysis reveals a key linguistic insight: explicitly modeling lip movements as a distinct modality significantly improves SLT performance by capturing critical non-manual cues.
翻译:音频是人类通信的主要模态,并推动了自动语音识别(ASR)技术的成功。然而,这类以音频为中心的系统本质上将失聪或听力障碍人群排除在外。视觉替代方案如手语和唇读提供了有效的替代方式,而近期手语翻译(SLT)与视觉语音识别(VSR)的进展已改善了无音频通信。然而,这些模态大多被孤立研究,其在统一框架内的整合仍探索不足。本文提出了首个能够处理手语、唇部动作与音频多样化组合以生成口语文本的统一框架。我们聚焦于三个主要目标:(i)设计一种能够有效处理异构输入的统一、模态无关的架构;(ii)探索模态间尚未被充分研究的协同作用,特别是唇部动作作为非手动线索在手语理解中的作用;(iii)实现与针对单个任务的先进模型相当或更优的性能。基于此框架,我们在SLT、VSR、ASR及视听语音识别任务中取得了与任务专用先进模型相当或更优的性能。此外,我们的分析揭示了一个关键的语言学洞见:将唇部动作显式建模为独立模态,通过捕捉关键的非手动线索,显著提升了SLT性能。