In Defense of Clip-based Video Relation Detection

Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.

翻译：视频视觉关系检测（VidVRD）旨在通过空间边界框和时间边界检测视频中的视觉关系三元组。现有VidVRD方法根据关系分类方式可大致分为自下而上和自上而下两种范式。自下而上方法采用基于片段的方式，先对短片段管对的关系进行分类，再将其合并为长视频关系；自上而下方法则直接对长视频管对进行分类。尽管近期基于视频管的方法展现出显著成效，但我们认为空间与时间上下文的有效建模比选择片段管或视频管更为关键。这一发现促使我们重新审视基于片段的范式，并探索VidVRD的成功关键因素。本文提出层次化上下文模型（HCM），该模型基于片段强化了面向对象空间上下文与面向关系时间上下文。实验证明，基于片段管的方法相比多数基于视频的方法可获得更优性能。此外，片段管在模型设计上具有更大灵活性，且有助于缓解视频管的固有局限，如长期目标跟踪难题及长程管特征压缩导致的时间信息损失。在两个具有挑战性的VidVRD基准上的大量实验表明，我们的HCM模型实现了新的最优性能，充分验证了在基于片段范式下融入先进空间与时间上下文建模的有效性。