Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M. Though these approaches have achieved significant improvement, the audio modality is often ignored in video captioning. In this work, we present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. Instead of relying on text transcripts extracted via automatic speech recognition (ASR), we argue that learning with raw audio signals can be more beneficial, as audio has additional information including acoustic events, speaker identity, etc. Our contributions are twofold. First, we observed that the model overspecializes to the audio modality when pre-training with both video and audio modality, since the ground truth (i.e., text transcripts) can be solely predicted using audio. We proposed a Modality Balanced Pre-training (MBP) loss to mitigate this issue and significantly improve the performance on downstream tasks. Second, we slice and dice different design choices of the cross-modal module, which may become an information bottleneck and generate inferior results. We proposed new local-global fusion mechanisms to improve information exchange across audio and video. We demonstrate significant improvements by leveraging the audio modality on four datasets, and even outperform the state of the art on some metrics without relying on the text modality as the input.
翻译:近期视频描述生成的研究重点在于设计能够同时处理视频和文本模态的架构,并利用带有文本转录的大规模视频数据集(如HowTo100M)进行预训练。尽管这些方法取得了显著进展,但音频模态在视频描述生成中常被忽视。本文提出了一种音频-视觉框架,旨在充分挖掘音频模态在描述生成中的潜力。我们认为,与依赖自动语音识别(ASR)提取的文本转录不同,直接利用原始音频信号更有优势,因为音频包含了声学事件、说话者身份等额外信息。本文贡献分为两点:首先,我们观察到当使用视频和音频模态联合预训练时,模型会过度专注于音频模态,因为仅凭音频即可预测真实标签(即文本转录)。为此,我们提出模态平衡预训练(MBP)损失函数来缓解这一问题,并显著提升下游任务性能。其次,我们对跨模态模块的不同设计选择进行了深入剖析——该模块可能成为信息瓶颈并导致次优结果。我们提出了新的局部-全局融合机制,以改善音频与视频之间的信息交换。通过在四个数据集上的实验,我们证明了利用音频模态可带来显著性能提升,部分指标甚至超越了现有最先进方法,且无需依赖文本模态作为输入。