Although large-scale video-language pre-training models, which usually build a global alignment between the video and the text, have achieved remarkable progress on various downstream tasks, the idea of adopting fine-grained information during the pre-training stage is not well explored. In this work, we propose STOA-VLP, a pre-training framework that jointly models object and action information across spatial and temporal dimensions. More specifically, the model regards object trajectories across frames and multiple action features from the video as fine-grained features. Besides, We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model. The first is the dynamic object-text alignment task, which builds a better connection between object trajectories and the relevant noun tokens. The second is the spatial-temporal action set prediction, which guides the model to generate consistent action features by predicting actions found in the text. Extensive experiments on three downstream tasks (video captioning, text-video retrieval, and video question answering) demonstrate the effectiveness of our proposed STOA-VLP (e.g. 3.7 Rouge-L improvements on MSR-VTT video captioning benchmark, 2.9% accuracy improvements on MSVD video question answering benchmark, compared to previous approaches).
翻译:尽管大规模视频-语言预训练模型通常构建视频与文本之间的全局对齐,并在各类下游任务中取得显著进展,但在预训练阶段利用细粒度信息的思路尚未得到充分探索。本文提出STOA-VLP框架,一种在空间和时间维度上联合建模对象与动作信息的预训练方法。具体而言,该模型将跨帧对象轨迹和视频中的多个动作特征视为细粒度特征。此外,我们设计了两项辅助任务,以更好地将这两类信息融入视频-语言模型的预训练过程:其一是动态对象-文本对齐任务,旨在建立对象轨迹与相关名词令牌之间的更强关联;其二是时空动作集预测任务,通过预测文本中的动作来引导模型生成一致的动作特征。在三个下游任务(视频字幕生成、文本-视频检索、视频问答)上的大量实验证明了所提出的STOA-VLP的有效性(例如,与先前方法相比,在MSR-VTT视频字幕基准上Rouge-L提升3.7,在MSVD视频问答基准上准确率提升2.9%)。