In this paper, we present ENTER, an interpretable Video Question Answering (VideoQA) system based on event graphs. Event graphs convert videos into graphical representations, where video events form the nodes and event-event relationships (temporal/causal/hierarchical) form the edges. This structured representation offers many benefits: 1) Interpretable VideoQA via generated code that parses event-graph; 2) Incorporation of contextual visual information in the reasoning process (code generation) via event graphs; 3) Robust VideoQA via Hierarchical Iterative Update of the event graphs. Existing interpretable VideoQA systems are often top-down, disregarding low-level visual information in the reasoning plan generation, and are brittle. While bottom-up approaches produce responses from visual data, they lack interpretability. Experimental results on NExT-QA, IntentQA, and EgoSchema demonstrate that not only does our method outperform existing top-down approaches while obtaining competitive performance against bottom-up approaches, but more importantly, offers superior interpretability and explainability in the reasoning process.
翻译:本文提出ENTER,一种基于事件图的可解释视频问答系统。事件图将视频转换为图结构表示,其中视频事件构成节点,事件间关系(时序/因果/层次)构成边。这种结构化表示具有多重优势:1)通过解析事件图生成的代码实现可解释视频问答;2)通过事件图在推理过程(代码生成)中融入上下文视觉信息;3)通过事件图的分层迭代更新实现鲁棒视频问答。现有可解释视频问答系统多为自上而下式,在推理计划生成中忽略底层视觉信息且系统脆弱;而自下而上方法虽从视觉数据生成响应,却缺乏可解释性。在NExT-QA、IntentQA和EgoSchema数据集上的实验结果表明,我们的方法不仅优于现有自上而下方法,同时与自下而上方法取得相当性能,更重要的是在推理过程中提供了更优的可解释性与可说明性。