Current video-based scene graph generation (VidSGG) methods have been found to perform poorly on predicting predicates that are less represented due to the inherent biased distribution in the training data. In this paper, we take a closer look at the predicates and identify that most visual relations (e.g. sit_above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-the-art VidSGG performance.
翻译:当前基于视频的场景图生成(VidSGG)方法在预测训练数据中因固有偏态分布而表示不足的谓词时表现不佳。本文深入审视谓词,发现大多数视觉关系(如sit_above)同时涉及动作模式(sit)和空间模式(above),而模式层面的分布偏态程度显著较轻。基于这一发现,我们提出解耦标签学习(DLL)范式,从模式层面解决棘手的视觉关系预测问题。具体而言,DLL解耦谓词标签,采用独立分类器分别学习动作模式和空间模式,随后将模式组合并映射回原始谓词。此外,我们提出知识级标签解耦方法,将同一模式内头部谓词的非目标知识迁移至尾部谓词,以校准尾部类别的分布。我们在常用VidSGG基准(即VidVRD)上验证了DLL的有效性。大量实验表明,DLL为长尾问题提供了极其简单却高效的解决方案,实现了最先进的VidSGG性能。