Composing Object Relations and Attributes for Image-Text Matching

We study the visual semantic embedding problem for image-text matching. Most existing work utilizes a tailored cross-attention mechanism to perform local alignment across the two image and text modalities. This is computationally expensive, even though it is more powerful than the unimodal dual-encoder approach. This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges. Utilizing a graph attention network, our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system. Representing caption as a scene graph offers the ability to utilize the strong relational inductive bias of graph neural networks to learn object-attribute and object-object relations effectively. To train the model, we propose losses that align the image and caption both at the holistic level (image-caption) and the local level (image-object entity), which we show is key to the success of the model. Our model is termed Composition model for Object Relations and Attributes, CORA. Experimental results on two prominent image-text retrieval benchmarks, Flickr30K and MSCOCO, demonstrate that CORA outperforms existing state-of-the-art computationally expensive cross-attention methods regarding recall score while achieving fast computation speed of the dual encoder.

翻译：本研究针对图像-文本匹配中的视觉语义嵌入问题展开探讨。现有研究大多采用定制的交叉注意力机制来实现图像与文本两种模态间的局部对齐，这种方法虽较单模态双编码器方案更为强大，但计算成本高昂。本文提出一种双编码器图像-文本匹配模型，通过场景图表示文本描述：以节点表示物体及其属性，并通过关系边相互连接。利用图注意力网络，我们的模型能高效编码物体-属性与物体-物体间的语义关系，从而构建出鲁棒且快速的系统。将文本描述表示为场景图，能够充分发挥图神经网络强大的关系归纳偏置优势，有效学习物体-属性及物体-物体间的关系。为训练模型，我们提出了在整体层面（图像-文本描述）和局部层面（图像-物体实体）同时对齐图像与文本描述的损失函数，并证明这是模型成功的关键。我们的模型被命名为物体关系与属性组合模型（CORA）。在Flickr30K和MSCOCO两个主流图像-文本检索基准数据集上的实验结果表明，CORA在召回率指标上超越了现有计算昂贵的交叉注意力方法，同时保持了双编码器的高速计算特性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日