Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods.
翻译:视频语义角色标注(VidSRL)旨在通过识别谓词-论元事件结构以及事件之间的相互关系,从给定视频中检测显著事件。尽管近期研究已提出多种VidSRL方法,但它们大多存在两个关键缺陷:缺乏细粒度空间场景感知以及对视频时间动态性的建模不足。为此,本文基于现有动态场景图结构,探索了一种新型整体时空场景图(即HostSG)表示方法,该表示能够同时有效建模视频的细粒度空间语义与时间动态信息。基于HostSG,我们提出了一种针对性强的VidSRL框架。首先设计场景-事件映射机制,以弥合底层场景结构与高层事件语义结构之间的鸿沟,从而生成整体层次化场景-事件(即ICE)图结构。我们进一步执行迭代结构优化以调整ICE图,使整体结构表示最大程度契合最终任务需求。最后,将VidSRL的三项子任务预测进行联合解码,该端到端范式有效避免了误差传播。在基准数据集上,我们的框架相比当前最佳模型取得了显著提升。通过进一步分析,可更深入理解我们方法的优越性。