Automatic lip-reading (ALR) aims to automatically transcribe spoken content from a speaker's silent lip motion captured in video. Current mainstream lip-reading approaches only use a single visual encoder to model input videos of a single scale. In this paper, we propose to enhance lip-reading by incorporating multi-scale video data and multi-encoder. Specifically, we first propose a novel multi-scale lip motion extraction algorithm based on the size of the speaker's face and an Enhanced ResNet3D visual front-end (VFE) to extract lip features at different scales. For the multi-encoder, in addition to the mainstream Transformer and Conformer, we also incorporate the recently proposed Branchformer and E-Branchformer as visual encoders. In the experiments, we explore the influence of different video data scales and encoders on ALR system performance and fuse the texts transcribed by all ALR systems using recognizer output voting error reduction (ROVER). Finally, our proposed approach placed second in the ICME 2024 ChatCLR Challenge Task 2, with a 21.52% reduction in character error rate (CER) compared to the official baseline on the evaluation set.
翻译:自动唇语识别(ALR)旨在从说话者无声唇部运动的视频中自动转录其语音内容。当前主流唇语识别方法仅使用单一视觉编码器处理单一尺度的输入视频。本文提出通过引入多尺度视频数据与多编码器来增强唇语识别性能。具体而言,我们首先基于说话者面部尺寸提出新颖的多尺度唇部运动提取算法,并采用增强型ResNet3D视觉前端(VFE)提取不同尺度的唇部特征。针对多编码器设计,除主流的Transformer与Conformer外,还引入了近期提出的Branchformer与E-Branchformer作为视觉编码器。实验部分探究了不同视频数据尺度与编码器对ALR系统性能的影响,并通过识别输出投票误差降级(ROVER)融合所有ALR系统转录的文本。最终,本方法在ICME 2024 ChatCLR挑战赛任务二中荣获第二名,在评测集上相比官方基线将字符错误率(CER)降低了21.52%。