VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.

翻译：尽管语音是人类与外界交互的一种简单有效的方式，但更真实的语音交互包含多模态信息，例如视觉、文本。如何设计一个统一框架来整合不同模态信息并利用不同资源（如视觉-音频对、音频-文本对、无标签语音和无标签文本）以促进语音表征学习尚未得到充分探索。本文提出一个统一的跨模态表征学习框架VATLM（视觉-音频-文本语言模型）。所提出的VATLM采用统一骨干网络建模模态无关信息，并利用三个简单的模态相关模块分别预处理视觉、语音和文本输入。为将这三种模态整合至一个共享语义空间，VATLM基于我们提出的统一分词器生成的统一令牌进行掩码预测任务优化。我们在与视听相关的下游任务上评估预训练的VATLM，包括视听语音识别（AVSR）和视觉语音识别（VSR）任务。结果表明，所提出的VATLM优于先前最先进的模型，如视听预训练的AV-HuBERT模型，分析也表明VATLM能够将不同模态对齐到同一空间。为促进未来研究，我们在https://aka.ms/vatlm 发布代码和预训练模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/