Video question answering (VideoQA) is a task to predict the correct answer to questions posed about a given video. The system must comprehend spatial and temporal relationships among objects extracted from videos to perform causal and temporal reasoning. While prior works have focused on modeling individual object movements using transformer-based methods, they falter when capturing complex scenarios involving multiple objects (e.g., "a boy is throwing a ball in a hoop"). We propose a contrastive language event graph representation learning method called CLanG to address this limitation. Aiming to capture event representations associated with multiple objects, our method employs a multi-layer GNN-cluster module for adversarial graph representation learning, enabling contrastive learning between the question text and its relevant multi-object event graph. Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA datasets, NExT-QA and TGIF-QA-R. In particular, it is 2.8% better than baselines in handling causal and temporal questions, highlighting its strength in reasoning multiple object-based events.
翻译:视频问答(VideoQA)是一项针对给定视频内容预测问题正确答案的任务。系统必须理解从视频中提取的物体之间的空间和时间关系,以执行因果和时间推理。先前的研究主要集中于使用基于Transformer的方法建模单个物体的运动,但在捕捉涉及多个物体的复杂场景(例如“一个男孩正在向篮筐投球”)时存在不足。为克服这一局限,我们提出了一种名为CLanG的对比语言事件图表示学习方法。该方法旨在捕获与多个物体相关联的事件表示,采用多层GNN-聚类模块进行对抗性图表示学习,从而实现问题文本与其相关的多目标事件图之间的对比学习。我们的方法在NExT-QA和TGIF-QA-R两个具有挑战性的VideoQA数据集上超越了强基线模型,准确率最高提升2.2%。特别是在处理因果和时间相关问题时,其表现较基线模型提升2.8%,突显了该方法在多物体事件推理方面的优势。