Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain occluded and blurred objects. 2) Label bias, primarily due to the high imbalance between a few positive relationship samples and numerous negative ones. Additionally, the distribution of relationships exhibits a long-tailed pattern. To address the above problems, in this paper, we introduce a network named TD$^2$-Net that aims at denoising and debiasing for dynamic SGG. Specifically, we first propose a denoising spatio-temporal transformer module that enhances object representation with robust contextual information. This is achieved by designing a differentiable Top-K object selector that utilizes the gumbel-softmax sampling strategy to select the relevant neighborhood for each object. Second, we introduce an asymmetrical reweighting loss to relieve the issue of label bias. This loss function integrates asymmetry focusing factors and the volume of samples to adjust the weights assigned to individual samples. Systematic experimental results demonstrate the superiority of our proposed TD$^2$-Net over existing state-of-the-art approaches on Action Genome databases. In more detail, TD$^2$-Net outperforms the second-best competitors by 12.7 \% on mean-Recall@10 for predicate classification.
翻译:动态场景图生成(Dynamic Scene Graph Generation, SGG)专注于检测视频中的物体并确定它们之间的成对关系。现有动态SGG方法通常存在以下问题:1)上下文噪声,即某些帧可能包含被遮挡或模糊的物体;2)标签偏差,主要源于少量正关系样本与大量负样本之间的高度不平衡,且关系分布呈现长尾模式。为解决上述问题,本文提出一个名为TD²-Net的网络,旨在实现动态SGG的去噪与去偏。具体而言,我们首先提出一个去噪时空变换器模块,通过鲁棒的上下文信息增强物体表示。该模块通过设计一个可微分的Top-K物体选择器实现,利用gumbel-softmax采样策略为每个物体选择相关邻域。其次,我们引入一种非对称重加权损失函数以缓解标签偏差问题。该损失函数整合了非对称聚焦因子和样本数量,以调整分配给各个样本的权重。系统实验结果表明,所提出的TD²-Net在Action Genome数据库上优于现有最先进方法。具体而言,在谓词分类任务中,TD²-Net的平均召回率@10(mean-Recall@10)指标比第二名竞争对手高出12.7%。