Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. As illustrated in Fig.~reffig:case (a), the models cannot make a distinction between ``An astronaut rides a horse" and ``A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning representations in multi-modal scenarios. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at https://github.com/zjukg/Structure-CLIP.
翻译:大规模视觉-语言预训练已在多模态理解与生成任务中取得了显著性能。然而,现有方法在需要结构化表示(即对象、属性和关系的表示)的图像-文本匹配任务中往往表现不佳。如图1(a)所示,模型无法区分“宇航员骑马”与“马骑宇航员”。这是因为它们在学习多模态场景中的表示时未能充分利用结构化知识。本文提出端到端框架Structure-CLIP,该框架整合场景图知识(SGK)以增强多模态结构化表示。首先,我们利用场景图指导语义负例的构建,从而加深对结构化表示学习的关注。此外,我们提出知识增强编码器(KEE),将SGK作为输入以进一步强化结构化表示。为验证所提框架的有效性,我们采用上述方法预训练模型,并在下游任务上开展实验。实验结果表明,Structure-CLIP在VG-Attribution和VG-Relation数据集上均取得最优性能(SOTA),分别领先多模态SOTA模型12.5%和4.1%。同时,MSCOCO上的实验表明,Structure-CLIP在维持通用表示能力的同时显著增强了结构化表示。我们的代码开源于https://github.com/zjukg/Structure-CLIP。