We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our pre-training objective involves encoding masked inputs, and then predicting contextualised targets generated by slowly-evolving momentum encoders. Driven by the inherent differences between video and audio, our design is asymmetric w.r.t. the two modalities' pretext tasks: Whereas the auditory stream predicts both the visual and auditory targets, the visual one predicts only the auditory targets. We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained. Notably, RAVEn surpasses all self-supervised methods on visual speech recognition (VSR) on LRS3, and combining RAVEn with self-training using only 30 hours of labelled data even outperforms a recent semi-supervised method trained on 90,000 hours of non-public data. At the same time, we achieve state-of-the-art results in the LRS3 low-resource setting for auditory speech recognition (as well as for VSR). Our findings point to the viability of learning powerful speech representations entirely from raw video and audio, i.e., without relying on handcrafted features. Code and models are available at https://github.com/ahaliassos/raven.
翻译:我们提出RAVEn,一种自监督多模态方法,用于联合学习视觉和听觉语音表征。我们的预训练目标包括对掩码输入进行编码,然后预测由缓慢演化的动量编码器生成的上下文目标。基于视频与音频固有的差异,我们的设计在两种模态的预训练任务上具有非对称性:听觉流同时预测视觉和听觉目标,而视觉流仅预测听觉目标。在单一预训练阶段联合训练编码器后,我们对视觉和听觉编码器进行微调,观察到在低资源和高资源标注数据场景中均取得了强劲结果。值得注意的是,RAVEn在LRS3数据集上的视觉语音识别(VSR)超越了所有自监督方法,且仅使用30小时标注数据结合自训练,其性能甚至优于近期在9万小时非公开数据上训练的半监督方法。与此同时,我们在LRS3低资源设置下的听觉语音识别(以及VSR)中实现了最先进结果。我们的研究结果表明,完全从原始视频和音频中学习强大的语音表征(即不依赖手工特征)是可行的。代码与模型可在https://github.com/ahaliassos/raven获取。