Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement over independently trained audio-only/visual-only and baseline fusion systems, respectively.
翻译:视觉语音(即唇部运动)与听觉语音在语音产生过程中具有共现性和同步性,因此高度相关。本文研究了这种相关性,并提出了一种跨模态语音协同学习范式。我们跨模态协同学习方法的主要动机是利用另一模态的知识辅助建模某一模态。具体而言,基于视听伪孪生结构引入两种跨模态增强器,以学习模态转换后的相关性。在每个增强器中,提出了一种嵌入最大特征图的Transformer变体,用于模态对齐和增强特征生成。该网络既从零开始进行协同学习,也使用预训练模型进行训练。在LRSLip3、GridLip、LomGridLip和VoxLip数据集上的实验结果表明,我们提出的方法相比独立训练的纯音频/纯视觉系统以及基线融合系统,分别实现了平均60%和20%的相对性能提升。