Learning similarity between scene graphs and images aims to estimate a similarity score given a scene graph and an image. There is currently no research dedicated to this task, although it is critical for scene graph generation and downstream applications. Scene graph generation is conventionally evaluated by Recall$@K$ and mean Recall$@K$, which measure the ratio of predicted triplets that appear in the human-labeled triplet set. However, such triplet-oriented metrics fail to demonstrate the overall semantic difference between a scene graph and an image and are sensitive to annotation bias and noise. Using generated scene graphs in the downstream applications is therefore limited. To address this issue, for the first time, we propose a Scene graPh-imAge coNtrastive learning framework, SPAN, that can measure the similarity between scene graphs and images. Our novel framework consists of a graph Transformer and an image Transformer to align scene graphs and their corresponding images in the shared latent space. We introduce a novel graph serialization technique that transforms a scene graph into a sequence with structural encodings. Based on our framework, we propose R-Precision measuring image retrieval accuracy as a new evaluation metric for scene graph generation. We establish new benchmarks on the Visual Genome and Open Images datasets. Extensive experiments are conducted to verify the effectiveness of SPAN, which shows great potential as a scene graph encoder.
翻译:学习场景图与图像之间的相似度旨在给定场景图和图像时估计相似度分数。尽管该任务对场景图生成及下游应用至关重要,但目前尚无专门研究。场景图生成通常通过Recall@K和mean Recall@K进行评估,这两个指标衡量预测的三元组在人工标注三元组集合中的比例。然而,这种以三元组为导向的指标无法体现场景图与图像之间的整体语义差异,且易受标注偏差和噪声影响。因此,生成场景图在下游应用中的使用受到限制。为解决此问题,我们首次提出一种场景图-图像对比学习框架SPAN,可衡量场景图与图像之间的相似度。该创新框架包含图Transformer和图像Transformer,用于在共享潜在空间中对齐场景图及其对应图像。我们引入一种新颖的图序列化技术,可将场景图转换为带有结构编码的序列。基于该框架,我们提出以图像检索精度R-Precision作为场景图生成的新评估指标。我们在Visual Genome和Open Images数据集上建立了新基准。通过大量实验验证了SPAN的有效性,表明其作为场景图编码器具有巨大潜力。