Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities

Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, in Figure 1, the abstract event of "war" manifests at a lower semantic level through subevents "tanks firing" (in video) and airplane "shot" (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research.

翻译：事件描述了世界上具有重要性的事件。自然而言，理解多媒体内容中提及的事件及其相互关系，是理解世界的重要途径。现有文献可推断文本与视觉（视频）领域中的事件是否相同（通过基础对齐），从而确认它们处于同一语义层级。然而，基础对齐无法捕捉因同一事件在不同语义层级被提及而存在的复杂跨事件关系。例如，在图1中，抽象事件“战争”通过子事件“坦克开火”（视频中）和飞机“被击中”（文本中）在较低语义层级显现，从而形成事件之间的层级化多模态关系。本文提出从多模态（视频和文本）数据中提取事件层级结构的任务，以捕捉同一事件如何在不同语义层级的不同模态中显现。这揭示了事件的结构，对理解事件至关重要。为支持该任务的研究，我们引入了多模态层级事件（MultiHiEve）数据集。与先前的视频-语言数据集不同，MultiHiEve由新闻视频-文章对组成，因此富含事件层级结构。我们对该数据集的部分内容进行密集标注，以构建测试基准。我们展示了当前最先进的单模态和多模态基线在该任务上的局限性。进一步，我们通过一种新的弱监督模型解决了这些局限性，该模型仅利用MultiHiEve中未标注的视频-文章对。我们对所提出的方法进行了全面评估，结果表明该方法在该任务上性能提升，并突出了未来研究的机遇。