IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images

Word embeddings, i.e., semantically meaningful vector representation of words, are largely influenced by the distributional hypothesis "You shall know a word by the company it keeps" (Harris, 1954), whereas modern prediction-based neural network embeddings rely on design choices and hyperparameter optimization. Word embeddings like Word2Vec, GloVe etc. well capture the contextuality and real-world analogies but contemporary convolution-based image embeddings such as VGGNet, AlexNet, etc. do not capture contextual knowledge. The popular king-queen analogy does not hold true for most commonly used vision embeddings. In this paper, we introduce a pre-trained joint embedding (JE), named IMAGINATOR, trained on 21K distinct image objects level from 1M image+text pairs. JE is a way to encode multimodal data into a vector space where the text modality serves as the ground-ing key, which the complementary modality (in this case, the image) is anchored with. IMAGINATOR encapsulates three individual representations: (i) object-object co-location, (ii) word-object co-location, and (iii) word-object correlation. These three ways capture complementary aspects of the two modalities which are further combined to obtain the final JEs. Generated JEs are intrinsically evaluated to assess how well they capture the contextuality and real-world analogies. We also evaluate pre-trained IMAGINATOR JEs on three downstream tasks: (i) image captioning, (ii) Image2Tweet, and (iii) text-based image retrieval. IMAGINATOR establishes a new standard on the aforementioned down-stream tasks by outperforming the current SoTA on all the selected tasks. IMAGINATOR will be made publicly available. The codes are available at https://github.com/varunakk/IMAGINATOR

翻译：词嵌入（即词语的语义向量表示）主要受分布假说“通过词的共现语境认知其含义”（Harris, 1954）影响，而现代基于预测的神经网络嵌入则依赖于设计选择与超参数优化。Word2Vec、GloVe等词嵌入能有效捕获语境性与现实类比关系，但当代基于卷积的图像嵌入（如VGGNet、AlexNet等）却无法捕获语境知识。常见的“国王-王后”类比关系在多数视觉嵌入中并不成立。本文提出一种名为IMAGINATOR的预训练联合嵌入（JE），该模型基于100万图像-文本对中的2.1万个独立图像对象层级进行训练。JE是一种将多模态数据编码至向量空间的方法，其中文本模态作为锚定关键，互补模态（此处为图像）通过此锚定实现对齐。IMAGINATOR包含三种独立表征：（i）对象-对象共现关系；（ii）词语-对象共现关系；（iii）词语-对象相关性。这三种方式捕获了两个模态的互补特征，进一步融合后得到最终JE。通过内在评估验证生成的JE对语境性与现实类比关系的捕获能力，并在图像描述生成、Image2Tweet任务及基于文本的图像检索三项下游任务中对预训练IMAGINATOR JE进行评测。IMAGINATOR在所有选定任务中均超越当前最先进水平，树立了上述下游任务的新标杆。该模型将开源发布，代码见https://github.com/varunakk/IMAGINATOR。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。