Scene graph generation (SGG) is an important task in image understanding because it represents the relationships between objects in an image as a graph structure, making it possible to understand the semantic relationships between objects intuitively. Previous SGG studies used a message-passing neural networks (MPNN) to update features, which can effectively reflect information about surrounding objects. However, these studies have failed to reflect the co-occurrence of objects during SGG generation. In addition, they only addressed the long-tail problem of the training dataset from the perspectives of sampling and learning methods. To address these two problems, we propose CooK, which reflects the Co-occurrence Knowledge between objects, and the learnable term frequency-inverse document frequency (TF-l-IDF) to solve the long-tail problem. We applied the proposed model to the SGG benchmark dataset, and the results showed a performance improvement of up to 3.8% compared with existing state-of-the-art models in SGGen subtask. The proposed method exhibits generalization ability from the results obtained, showing uniform performance improvement for all MPNN models.
翻译:场景图生成(Scene Graph Generation, SGG)是图像理解中的一项重要任务,因为它将图像中物体之间的关系表示为图结构,从而能够直观地理解物体间的语义关系。以往的SGG研究采用消息传递神经网络(Message-Passing Neural Networks, MPNN)来更新特征,这能有效反映周围物体的信息。然而,这些研究在SGG生成过程中未能反映物体的共现性。此外,它们仅从采样和学习方法的角度解决训练数据集的长尾问题。针对这两个问题,我们提出了CooK,该方法反映物体间的共现知识,并采用可学习的词频-逆文档频率(Term Frequency-Inverse Document Frequency, TF-IDF)来解决长尾问题。我们将所提出的模型应用于SGG基准数据集,结果显示,在SGGen子任务中,与现有最先进模型相比,性能提升高达3.8%。所提出的方法展现出泛化能力,对所有MPNN模型均表现出统一的性能提升。