We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code will be available at https://github.com/doc-doc/CoVGT.
翻译:我们提出通过视频图变换器模型(CoVGT)以对比方式执行视频问答(VideoQA)。CoVGT的独特性和优越性体现在三个方面:1)它提出了一种动态图变换器模块,通过显式捕捉视觉对象、它们的关系及动态变化对视频进行编码,以进行复杂的时空推理。2)它设计了独立的视频和文本变换器用于视频与文本之间的对比学习以执行问答,而非使用多模态变换器进行答案分类。细粒度的视频-文本通信通过额外的跨模态交互模块实现。3)它通过联合完全监督和自监督的对比目标进行优化,分别针对正确答案与错误答案、以及相关问题与不相关问题。凭借优越的视频编码和问答方案,我们证明CoVGT在视频推理任务上能够取得远超先前技术的最佳性能。其性能甚至超越了那些使用数百万外部数据进行预训练的模型。我们进一步表明,CoVGT也能受益于跨模态预训练,尽管所需数据量少几个数量级。实验结果证明了CoVGT的有效性和优越性,并揭示了其在更高效数据预训练方面的潜力。我们期望这一成功能够推动视频问答超越粗粒度识别/描述,迈向视频内容的细粒度关系推理。我们的代码将在https://github.com/doc-doc/CoVGT公开。