Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities

Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, in Figure 1, the abstract event of "war" manifests at a lower semantic level through subevents "tanks firing" (in video) and airplane "shot" (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research.

翻译：事件描述了现实世界中具有重要性的发生。自然而言，理解多模态内容中提及的事件及其相互关系，是认知世界的重要方式。现有研究能推断文本与视觉（视频）领域的事件是否相同（通过基础对齐），从而确认它们处于同一语义层级。然而，基础对齐无法捕捉因同一事件在多重语义层级被引用而产生的复杂跨事件关系。例如图1中，“战争”这一抽象事件通过“坦克开火”（视频）和飞机“被击落”（文本）等子事件在较低语义层级显现，形成了事件之间的层级化多模态关联。本文提出从多模态（视频与文本）数据中提取事件层次结构的任务，旨在捕捉同一事件如何以不同语义层级在不同模态中呈现。这揭示了事件的结构，对理解事件至关重要。为支持该任务的研究，我们构建了多模态层次事件（MultiHiEve）数据集。与先前的视频-语言数据集不同，MultiHiEve由新闻视频-文章配对组成，因此富含事件层级结构。我们对部分数据集进行密集标注以构建测试基准，并展示了当前最先进的单模态与多模态基线方法在该任务上的局限性。进一步，我们提出一种新型弱监督模型，仅利用MultiHiEve中未标注的视频-文章配对来克服这些局限。通过全面评估，本方法在该任务中展现出更优性能，并为未来研究指明了方向。