Recent advances in Scene Graph Generation (SGG) typically model the relationships among entities utilizing box-level features from pre-defined detectors. We argue that an overlooked problem in SGG is the coarse-grained interactions between boxes, which inadequately capture contextual semantics for relationship modeling, practically limiting the development of the field. In this paper, we take the initiative to explore and propose a generic paradigm termed Superpixel-based Interaction Learning (SIL) to remedy coarse-grained interactions at the box level. It allows us to model fine-grained interactions at the superpixel level in SGG. Specifically, (i) we treat a scene as a set of points and cluster them into superpixels representing sub-regions of the scene. (ii) We explore intra-entity and cross-entity interactions among the superpixels to enrich fine-grained interactions between entities at an earlier stage. Extensive experiments on two challenging benchmarks (Visual Genome and Open Image V6) prove that our SIL enables fine-grained interaction at the superpixel level above previous box-level methods, and significantly outperforms previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing box-level approaches in a plug-and-play fashion. In particular, SIL brings an average improvement of 2.0% mR (even up to 3.4%) of baselines for the PredCls task on Visual Genome, which facilitates its integration into any existing box-level method.
翻译:场景图生成(Scene Graph Generation, SGG)的最新进展通常利用预定义检测器的框级特征来建模实体间关系。我们认为SGG中被忽视的问题是框与框之间的粗粒度交互,这不足以充分捕捉关系建模所需的上下文语义,实际上限制了该领域的发展。在本文中,我们首次探索并提出了一种名为基于超像素的交互学习(Superpixel-based Interaction Learning, SIL)的通用范式,以弥补框级粗粒度交互的不足。该方法允许我们在SGG中实现超像素层面的细粒度交互建模。具体而言:(i) 我们将场景视为点的集合,并将其聚类为表示场景子区域的超像素;(ii) 我们探索超像素间的实体内部和跨实体交互,从而在更早阶段丰富实体间的细粒度交互。在具有挑战性的Visual Genome和Open Image V6两个基准数据集上的大量实验证明,与之前的框级方法相比,我们的SIL能够在超像素层面实现细粒度交互,并在所有指标上显著超越先前的最优方法。更令人鼓舞的是,所提方法可以即插即用地提升现有框级方法的性能。具体而言,在Visual Genome数据集的PredCls任务上,SIL使基线方法的mR平均提升2.0%(最高可达3.4%),这便于其与任何现有框级方法集成。