Video-language pre-trained models have shown remarkable success in guiding video question-answering (VideoQA) tasks. However, due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones. This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains. To bridge these gaps, in this paper, we propose Tem-Adapter, which enables the learning of temporal dynamics and complex semantics by a visual Temporal Aligner and a textual Semantic Aligner. Unlike conventional pretrained knowledge adaptation methods that only concentrate on the downstream task objective, the Temporal Aligner introduces an extra language-guided autoregressive task aimed at facilitating the learning of temporal dependencies, with the objective of predicting future states based on historical clues and language guidance that describes event progression. Besides, to reduce the semantic gap and adapt the textual representation for better event description, we introduce a Semantic Aligner that first designs a template to fuse question and answer pairs as event descriptions and then learns a Transformer decoder with the whole video sequence as guidance for refinement. We evaluate Tem-Adapter and different pre-train transferring methods on two VideoQA benchmarks, and the significant performance improvement demonstrates the effectiveness of our method.
翻译:视频-语言预训练模型在引导视频问答任务中展现出了显著的成功。然而,由于视频序列的长度,训练大规模视频模型相比图像模型成本更高。这促使我们利用图像预训练的知识,尽管图像与视频领域之间存在明显差距。为弥合这些差距,本文提出Tem-Adapter,其通过视觉时序对齐器与文本语义对齐器实现时序动态与复杂语义的学习。不同于仅关注下游任务目标的传统预训练知识自适应方法,时序对齐器引入了一种额外的语言引导自回归任务,旨在通过基于历史线索与描述事件进程的语言引导来预测未来状态,从而促进时序依赖的学习。此外,为缩小语义差距并调整文本表示以更好地描述事件,我们引入语义对齐器,首先设计模板将问答对融合为事件描述,随后以完整视频序列作为引导,训练一个Transformer解码器进行精炼。我们在两个VideoQA基准上评估了Tem-Adapter及不同预训练迁移方法,显著的性能提升证明了我们方法的有效性。