Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. Previous models cannot make a distinction between ``An astronaut rides a horse" and ``A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning representations in multi-modal scenarios. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code will be available soon.
翻译:大规模视觉-语言预训练在多模态理解与生成任务中已取得显著成效。然而,现有方法在处理需要结构化表示(即物体、属性及关系的表示)的图像-文本匹配任务时表现欠佳。传统模型无法区分“一名宇航员骑着马”与“一匹马骑着宇航员”这类语义差异,根源在于多模态场景下学习表示时未能充分利用结构化知识。本文提出端到端框架Structure-CLIP,通过融合场景图知识(SGK)来增强多模态结构化表示。首先,利用场景图指导语义负例的构建,从而强化对结构化表示的学习权重;其次,设计知识增强编码器(KEE),将SGK作为输入以进一步优化结构化表示。为验证框架有效性,我们采用上述方法预训练模型并在下游任务中进行实验。结果表明,Structure-CLIP在VG-Attribution和VG-Relation数据集上分别以12.5%和4.1%的优势超越多模态SOTA模型,达到最先进水平。同时,MSCOCO上的实验证明,Structure-CLIP在保持通用表示能力的同时显著提升了结构化表示质量。相关代码即将开源。