Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.
翻译:尽管语音是人类与外界交互的一种简单有效的方式,但更真实的语音交互包含多模态信息,例如视觉、文本。如何设计一个统一框架来整合不同模态信息并利用不同资源(如视觉-音频对、音频-文本对、无标签语音和无标签文本)以促进语音表征学习尚未得到充分探索。本文提出一个统一的跨模态表征学习框架VATLM(视觉-音频-文本语言模型)。所提出的VATLM采用统一骨干网络建模模态无关信息,并利用三个简单的模态相关模块分别预处理视觉、语音和文本输入。为将这三种模态整合至一个共享语义空间,VATLM基于我们提出的统一分词器生成的统一令牌进行掩码预测任务优化。我们在与视听相关的下游任务上评估预训练的VATLM,包括视听语音识别(AVSR)和视觉语音识别(VSR)任务。结果表明,所提出的VATLM优于先前最先进的模型,如视听预训练的AV-HuBERT模型,分析也表明VATLM能够将不同模态对齐到同一空间。为促进未来研究,我们在https://aka.ms/vatlm 发布代码和预训练模型。