Automatic lip-reading (ALR) aims to automatically transcribe spoken content from a speaker's silent lip motion captured in video. Current mainstream lip-reading approaches only use a single visual encoder to model input videos of a single scale. In this paper, we propose to enhance lipreading by incorporating multi-scale video data and multi-encoder. Specifically, we first propose a novel multi-scale lip extraction algorithm based on the size of the speaker's face and an enhanced ResNet3D visual front-end (VFE) to extract lip features at different scales. For the multi-encoder, in addition to the mainstream Transformer and Conformer, we also incorporate the recently proposed Branchformer and EBranchformer as visual encoders. In the experiments, we explore the influence of different video data scales and encoders on ALR system performance and fuse the texts transcribed by all ALR systems using recognizer output voting error reduction (ROVER). Finally, our proposed approach placed second in the ICME 2024 ChatCLR Challenge Task 2, with a 21.52% reduction in character error rate (CER) compared to the official baseline on the evaluation set.
翻译:自动唇语识别(ALR)旨在从视频中捕捉说话者无声的唇部运动,自动转录其口语内容。当前主流的唇语识别方法仅使用单一视觉编码器对单尺度输入视频进行建模。本文提出通过融合多尺度视频数据与多编码器来增强唇语识别性能。具体而言,我们首先基于说话者面部尺寸提出一种新颖的多尺度唇部提取算法,并采用增强型ResNet3D视觉前端(VFE)提取不同尺度的唇部特征。对于多编码器,除主流的Transformer和Conformer外,我们还引入近期提出的Branchformer与EBranchformer作为视觉编码器。实验中,我们探究了不同视频数据尺度与编码器对ALR系统性能的影响,并通过识别器输出投票错误率降低算法(ROVER)融合所有ALR系统转录的文本。最终,所提方法在ICME 2024 ChatCLR挑战赛任务二中位列第二,相较官方基线在评估集上实现了21.52%的字错误率(CER)降低。