Lipreading refers to understanding and further translating the speech of a speaker in the video into natural language. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference sets. However, generalizing these methods to unseen speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank and the evident visual variations caused by the shape/color of lips for different speakers. Therefore, merely depending on the visible changes of lips tends to cause model overfitting. To address this problem, we propose to use multi-modal features across visual and landmarks, which can describe the lip motion irrespective to the speaker identities. Then, we develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer. Specifically, LipFormer consists of a lip motion stream, a facial landmark stream, and a cross-modal fusion. The embeddings from the two streams are produced by self-attention, which are fed to the cross-attention module to achieve the alignment between visuals and landmarks. Finally, the resulting fused features can be decoded to output texts by a cascade seq2seq model. Experiments demonstrate that our method can effectively enhance the model generalization to unseen speakers.
翻译:唇语识别是指理解并进一步将视频中说话人的语音内容翻译为自然语言。当前最先进的唇语识别方法在识别重叠说话人(即同时出现在训练集和推理集中的说话人)方面表现优异。然而,由于训练库中说话人数量有限,且不同说话人的唇形/颜色存在显著视觉差异,将这些方法泛化到未见说话人时会导致性能急剧下降。因此,仅依赖嘴唇的可见变化容易导致模型过拟合。为解决该问题,我们提出使用跨视觉与地标的多模态特征,该特征能描述与说话人身份无关的唇部运动。进而,我们开发了基于视觉-地标Transformer的句子级唇语识别框架LipFormer。具体而言,LipFormer包含唇部运动流、面部地标流和跨模态融合模块。两个流生成的嵌入通过自注意力机制计算,并输入跨注意力模块以实现视觉与地标信息的对齐。最终,融合特征可通过级联seq2seq模型解码输出文本。实验表明,本方法能有效提升模型对未见说话人的泛化能力。