多目标事件图表示学习在视频问答中的应用 (Multi-object event graph representation learning for Video Question Answering)

Video question answering (VideoQA) is a task to predict the correct answer to questions posed about a given video. The system must comprehend spatial and temporal relationships among objects extracted from videos to perform causal and temporal reasoning. While prior works have focused on modeling individual object movements using transformer-based methods, they falter when capturing complex scenarios involving multiple objects (e.g., "a boy is throwing a ball in a hoop"). We propose a contrastive language event graph representation learning method called CLanG to address this limitation. Aiming to capture event representations associated with multiple objects, our method employs a multi-layer GNN-cluster module for adversarial graph representation learning, enabling contrastive learning between the question text and its relevant multi-object event graph. Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA datasets, NExT-QA and TGIF-QA-R. In particular, it is 2.8% better than baselines in handling causal and temporal questions, highlighting its strength in reasoning multiple object-based events.

翻译：视频问答（VideoQA）是一项针对给定视频内容预测问题正确答案的任务。系统必须理解从视频中提取的物体之间的空间和时间关系，以执行因果和时间推理。先前的研究主要集中于使用基于Transformer的方法建模单个物体的运动，但在捕捉涉及多个物体的复杂场景（例如“一个男孩正在向篮筐投球”）时存在不足。为克服这一局限，我们提出了一种名为CLanG的对比语言事件图表示学习方法。该方法旨在捕获与多个物体相关联的事件表示，采用多层GNN-聚类模块进行对抗性图表示学习，从而实现问题文本与其相关的多目标事件图之间的对比学习。我们的方法在NExT-QA和TGIF-QA-R两个具有挑战性的VideoQA数据集上超越了强基线模型，准确率最高提升2.2%。特别是在处理因果和时间相关问题时，其表现较基线模型提升2.8%，突显了该方法在多物体事件推理方面的优势。

相关内容

事理图谱

关注 12

事理图谱(Eventic Graph, EG)本质上是一个事理逻辑知识库。事件之间在时间、空间上相继发生的演化规律和模式是一种十分有价值的事理知识，人类依赖对于这类事理知识的深刻理解来指导日常生活实践，改造客观事物。然而，现有的典型知识图谱主要是以实体及其属性和关系为研究核心，缺乏对事理逻辑这一重要人类知识的刻画。为了弥补这一不足，事理图谱应运而生，它能够揭示事件的演化规律和发展逻辑，刻画和记录人类行为活动。事理图谱对于事件预测、意图挖掘、问答系统、人机交互等上层应用都能够起到很好的辅助作用。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日