Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation

Scene Graph Generation (SGG) aims to structurally and comprehensively represent objects and their connections in images, it can significantly benefit scene understanding and other related downstream tasks. Existing SGG models often struggle to solve the long-tailed problem caused by biased datasets. However, even if these models can fit specific datasets better, it may be hard for them to resolve the unseen triples which are not included in the training set. Most methods tend to feed a whole triple and learn the overall features based on statistical machine learning. Such models have difficulty predicting unseen triples because the objects and predicates in the training set are combined differently as novel triples in the test set. In this work, we propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models. We propose a Joint Fearture Learning (JFL) module and a Factual Knowledge based Refinement (FKR) module to learn object and predicate categories separately at the feature level and align them with corresponding visual features so that the model is no longer limited to triples matching. Besides, since we observe the long-tailed problem also affects the generalization ability, we design a novel balanced learning strategy, including a Charater Guided Sampling (CGS) and an Informative Re-weighting (IR) module, to provide tailor-made learning methods for each predicate according to their characters. Extensive experiments show that our model achieves state-of-the-art performance. In more detail, TISGG boosts the performances by 11.7% of zR@20(zero-shot recall) on the PredCls sub-task on the Visual Genome dataset.

翻译：场景图生成旨在结构化、全面地表示图像中的对象及其关系，可显著提升场景理解及其他相关下游任务。现有场景图生成模型通常难以解决由偏差数据集导致的长尾问题。然而，即使这些模型能更好地拟合特定数据集，它们仍难以处理训练集中未包含的未见三元组。多数方法倾向于输入完整三元组，并基于统计机器学习学习整体特征，此类模型难以预测未见三元组，因为训练集中的对象与谓词在测试集中会以不同方式组合为新型三元组。本文提出一种文本-图像联合场景图生成模型以解决未见三元组问题，提升场景图生成模型的泛化能力。我们设计联合特征学习模块和事实知识精炼模块，在特征层面分别学习对象和谓词类别，并将其与对应视觉特征对齐，使模型不再局限于三元组匹配。此外，观察到长尾问题同样影响泛化能力，我们提出一种新型平衡学习策略，包括特征引导采样模块和信息重加权模块，根据各谓词特征提供定制化学习方法。大量实验表明，我们的模型达到了最先进的性能。具体而言，在Visual Genome数据集的PredCls子任务上，TISGG将零样本召回率提升11.7%（zR@20）。