Understanding event relationships in videos requires a model to understand the underlying structures of events (i.e. the event type, the associated argument roles, and corresponding entities) and factual knowledge for reasoning. Structural symbolic representation (SSR) based methods directly take event types and associated argument roles/entities as inputs to perform reasoning. However, the state-of-the-art video event-relation prediction system shows the necessity of using continuous feature vectors from input videos; existing methods based solely on SSR inputs fail completely, even when given oracle event types and argument roles. In this paper, we conduct an extensive empirical analysis to answer the following questions: 1) why SSR-based method failed; 2) how to understand the evaluation setting of video event relation prediction properly; 3) how to uncover the potential of SSR-based methods. We first identify suboptimal training settings as causing the failure of previous SSR-based video event prediction models. Then through qualitative and quantitative analysis, we show how evaluation that takes only video as inputs is currently unfeasible, as well as the reliance on oracle event information to obtain an accurate evaluation. Based on these findings, we propose to further contextualize the SSR-based model to an Event-Sequence Model and equip it with more factual knowledge through a simple yet effective way of reformulating external visual commonsense knowledge bases into an event-relation prediction pretraining dataset. The resultant new state-of-the-art model eventually establishes a 25% Macro-accuracy performance boost.
翻译:理解视频中的事件关系需要模型掌握事件的基本结构(即事件类型、相关论元角色及对应实体)以及进行推理的事实知识。基于结构化符号表征(SSR)的方法直接将事件类型和相关论元角色/实体作为输入进行推理。然而,最先进的视频事件关系预测系统表明,必须使用来自输入视频的连续特征向量;即使提供"神谕"(oracle)级别的事件类型和论元角色,仅依赖SSR输入的现有方法仍完全失效。本文通过广泛的实证分析回答以下问题:1)基于SSR的方法为何失败;2)如何正确理解视频事件关系预测的评估设置;3)如何发掘SSR方法的潜力。我们首先确定次优训练设置是导致先前基于SSR的视频事件预测模型失败的原因。随后通过定性和定量分析表明,当前仅以视频为输入的评估方式不可行,且准确评估依赖于神谕级事件信息。基于这些发现,我们提出将SSR模型进一步情境化为事件序列模型(Event-Sequence Model),并通过一种简单有效的方式(将外部视觉常识知识库重构为事件关系预测预训练数据集)为其配备更多事实知识。由此产生的最新最优模型最终实现了25%的宏平均(Macro-accuracy)性能提升。