CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound.

翻译：长期以来，人们一直致力于构建统一的音视频-文本模型，以实现各种多模态理解任务，这模拟了人类听、看和阅读的过程。人类倾向于使用两个独立的系统来表征知识：一个用于表征语言（文本）信息，另一个用于表征非语言（视觉和听觉）信息。这两个系统可以独立运作，但也能相互交互。受人类认知理解的启发，本文提出了CoAVT——一种新颖的、受认知启发的关联音视频-文本预训练模型，用于连接这三种模态。该模型包含一个联合音视频编码器，该编码器学习将音视频同步信息与音频和视觉内容一起编码为非语言信息；以及一个文本编码器，用于处理文本输入中的语言信息。为了弥合模态之间的差距，CoAVT采用了一个查询编码器，其中包含一组可学习的查询嵌入，并提取对应文本中信息量最丰富的音视频特征。此外，为了分别利用音频与视觉和语言之间的对应关系，我们还在基础的音视频-文本三模态对齐基础上，建立了音频-文本和视觉-文本双模态对齐，以增强多模态表征学习。最后，我们通过三个多模态目标联合优化CoAVT模型：对比损失、匹配损失和语言建模损失。大量实验表明，CoAVT能够学习到强大的多模态相关性，并泛化到各种下游任务中。在AudioCaps数据集上的文本-视频检索任务中，CoAVT在零样本和微调设置下均取得了新的最优性能；在AudioSet和VGGSound数据集上的音视频事件分类和音视频检索任务中也同样如此。