We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code is available at https://github.com/doc-doc/CoVGT.
翻译:我们提出通过视频图Transformer模型(CoVGT)以对比方式进行视频问答(VideoQA)。CoVGT的独特性和优越性体现在三个方面:1)提出动态图Transformer模块,通过显式捕捉视觉对象、对象间关系及动态变化来编码视频,实现复杂的时空推理;2)设计分离的视频和文本Transformer用于视频与文本之间的对比学习以执行问答,而非采用多模态Transformer进行答案分类,并通过额外的跨模态交互模块实现细粒度的视频-文本通信;3)通过联合全监督和自监督的对比目标进行优化,分别针对正确答案与错误答案之间、相关问句与无关问句之间的对比学习。凭借优越的视频编码与问答方案,我们证明CoVGT在视频推理任务上的性能远优于先前技术,甚至超越了使用数百万外部数据进行预训练的模型。我们进一步表明,CoVGT也可从跨模态预训练中获益,且所需数据量低数个数量级。实验结果验证了CoVGT的有效性与优越性,并揭示了其在数据高效预训练方面的潜力。我们期望这一成果能将视频问答从粗粒度识别/描述推进至视频内容的细粒度关系推理。我们的代码开源在https://github.com/doc-doc/CoVGT。