Scene graph generation aims to detect visual relationship triplets, (subject, predicate, object). Due to biases in data, current models tend to predict common predicates, e.g. "on" and "at", instead of informative ones, e.g. "standing on" and "looking at". This tendency results in the loss of precise information and overall performance. If a model only uses "stone on road" rather than "stone blocking road" to describe an image, it may be a grave misunderstanding. We argue that this phenomenon is caused by two imbalances: semantic space level imbalance and training sample level imbalance. For this problem, we propose DB-SGG, an effective framework based on debiasing but not the conventional distribution fitting. It integrates two components: Semantic Debiasing (SD) and Balanced Predicate Learning (BPL), for these imbalances. SD utilizes a confusion matrix and a bipartite graph to construct predicate relationships. BPL adopts a random undersampling strategy and an ambiguity removing strategy to focus on informative predicates. Benefiting from the model-agnostic process, our method can be easily applied to SGG models and outperforms Transformer by 136.3%, 119.5%, and 122.6% on mR@20 at three SGG sub-tasks on the SGG-VG dataset. Our method is further verified on another complex SGG dataset (SGG-GQA) and two downstream tasks (sentence-to-graph retrieval and image captioning).
翻译:场景图生成旨在检测视觉关系三元组(主体,谓词,客体)。由于数据中的偏差,当前模型倾向于预测常见谓词(例如"在……上"和"在……处"),而非信息丰富的谓词(例如"站在……上"和"注视着")。这种倾向会导致精确信息的丢失和整体性能下降。如果模型仅用"石头在路上"而非"石头挡住了路"来描述图像,则可能造成严重误解。我们认为这种现象由两种不平衡引起:语义空间层面的不平衡和训练样本层面的不平衡。针对此问题,我们提出DB-SGG,一种基于去偏而非传统分布拟合的有效框架。该框架整合了两个组件:语义去偏(SD)和平衡谓词学习(BPL),分别应对这些不平衡。SD利用混淆矩阵和二分图构建谓词关系。BPL采用随机欠采样策略和歧义消除策略,聚焦于信息丰富的谓词。得益于与模型无关的处理过程,我们的方法可轻松应用于SGG模型,在SGG-VG数据集的三个SGG子任务上,相比Transformer,mR@20指标分别提升136.3%、119.5%和122.6%。该方法还在另一个复杂SGG数据集(SGG-GQA)及两个下游任务(句子到图检索与图像描述)上得到进一步验证。