Dynamic scene graphs generated from video clips could help enhance the semantic visual understanding in a wide range of challenging tasks such as environmental perception, autonomous navigation, and task planning of self-driving vehicles and mobile robots. In the process of temporal and spatial modeling during dynamic scene graph generation, it is particularly intractable to learn time-variant relations in dynamic scene graphs among frames. In this paper, we propose a Time-variant Relation-aware TRansformer (TR$^2$), which aims to model the temporal change of relations in dynamic scene graphs. Explicitly, we leverage the difference of text embeddings of prompted sentences about relation labels as the supervision signal for relations. In this way, cross-modality feature guidance is realized for the learning of time-variant relations. Implicitly, we design a relation feature fusion module with a transformer and an additional message token that describes the difference between adjacent frames. Extensive experiments on the Action Genome dataset prove that our TR$^2$ can effectively model the time-variant relations. TR$^2$ significantly outperforms previous state-of-the-art methods under two different settings by 2.1% and 2.6% respectively.
翻译:从视频片段生成的动态场景图有助于增强环境感知、自主导航以及自动驾驶车辆和移动机器人的任务规划等多项挑战性任务中的语义视觉理解。在动态场景图生成过程中的时间与空间建模中,学习帧间动态场景图的时变关系尤为棘手。本文提出一种时变关系感知变换器(TR²),旨在建模动态场景图中关系的时间变化。显式地,我们利用关系标签提示句的文本嵌入差异作为关系的监督信号,通过这种方式实现跨模态特征引导以学习时变关系。隐式地,我们设计了一个基于变换器的关系特征融合模块,并引入一个描述相邻帧差异的消息令牌。在Action Genome数据集上的大量实验证明,我们的TR²能够有效建模时变关系。在两种不同设置下,TR²分别以2.1%和2.6%的显著优势超越以往最先进方法。