Scene Graph Generation (SGG) remains a challenging task due to its compositional property. Previous approaches improve prediction efficiency by learning in an end-to-end manner. However, these methods exhibit limited performance as they assume unidirectional conditioning between entities and predicates, leading to insufficient information interaction. To address this limitation, we propose a novel bidirectional conditioning factorization for SGG, introducing efficient interaction between entities and predicates. Specifically, we develop an end-to-end scene graph generation model, Bidirectional Conditioning Transformer (BCTR), to implement our factorization. BCTR consists of two key modules. First, the Bidirectional Conditioning Generator (BCG) facilitates multi-stage interactive feature augmentation between entities and predicates, enabling mutual benefits between the two predictions. Second, Random Feature Alignment (RFA) regularizes the feature space by distilling multi-modal knowledge from pre-trained models, enhancing BCTR's ability on tailed categories without relying on statistical priors. We conduct a series of experiments on Visual Genome and Open Image V6, demonstrating that BCTR achieves state-of-the-art performance on both benchmarks. The code will be available upon acceptance of the paper.
翻译:场景图生成(SGG)因其组合性质而仍是一项具有挑战性的任务。先前的研究通过端到端学习方式提升了预测效率,但这些方法因假设实体与谓词间仅存在单向条件作用,导致信息交互不足,从而表现出有限的性能。为克服此局限,我们提出一种新颖的双向条件分解方法,以促进实体与谓词间的高效交互。具体而言,我们开发了一个端到端的场景图生成模型——双向条件Transformer(BCTR)来实现该分解。BCTR包含两个核心模块:首先,双向条件生成器(BCG)通过多阶段交互式特征增强机制促进实体与谓词间的双向信息流动,使两类预测任务相互受益;其次,随机特征对齐(RFA)模块通过从预训练模型中蒸馏多模态知识来正则化特征空间,从而在不依赖统计先验的情况下增强BCTR对长尾类别的处理能力。我们在Visual Genome和Open Image V6数据集上进行了一系列实验,结果表明BCTR在两个基准测试中均达到了最先进的性能。代码将在论文录用后公开。