Scene Graph Generation (SGG) plays a pivotal role in downstream vision-language tasks. Existing SGG methods typically suffer from poor compositional generalizations on unseen triplets. They are generally trained on incompletely annotated scene graphs that contain dominant triplets and tend to bias toward these seen triplets during inference. To address this issue, we propose a Triplet Calibration and Reduction (T-CAR) framework in this paper. In our framework, a triplet calibration loss is first presented to regularize the representations of diverse triplets and to simultaneously excavate the unseen triplets in incompletely annotated training scene graphs. Moreover, the unseen space of scene graphs is usually several times larger than the seen space since it contains a huge number of unrealistic compositions. Thus, we propose an unseen space reduction loss to shift the attention of excavation to reasonable unseen compositions to facilitate the model training. Finally, we propose a contextual encoder to improve the compositional generalizations of unseen triplets by explicitly modeling the relative spatial relations between subjects and objects. Extensive experiments show that our approach achieves consistent improvements for zero-shot SGG over state-of-the-art methods. The code is available at https://github.com/jkli1998/T-CAR.
翻译:场景图生成(SGG)在下游视觉-语言任务中起着关键作用。现有的SGG方法通常难以对未见三元组实现良好的组合泛化。它们通常在包含主导三元组的不完全标注场景图上进行训练,并且在推理过程中倾向于偏向这些可见三元组。为解决这一问题,本文提出了一种三元组校准与缩减(T-CAR)框架。在该框架中,首先提出了一种三元组校准损失,用于规范多样化三元组的表示,并同时挖掘不完全标注训练场景图中的未见三元组。此外,由于场景图的未见空间通常比可见空间大数倍(因其包含大量不现实的组合),我们提出了一种未见空间缩减损失,将挖掘的注意力转移到合理的未见组合上,以促进模型训练。最后,我们提出了一种上下文编码器,通过显式建模主体与客体之间的相对空间关系,提升未见三元组的组合泛化能力。大量实验表明,我们的方法在零样本SGG任务上相较现有最优方法取得了一致性的改进。代码已开源至https://github.com/jkli1998/T-CAR。