SITTA: A Semantic Image-Text Alignment for Image Captioning

Textual and semantic comprehension of images is essential for generating proper captions. The comprehension requires detection of objects, modeling of relations between them, an assessment of the semantics of the scene and, finally, representing the extracted knowledge in a language space. To achieve rich language capabilities while ensuring good image-language mappings, pretrained language models (LMs) were conditioned on pretrained multi-modal (image-text) models that allow for image inputs. This requires an alignment of the image representation of the multi-modal model with the language representations of a generative LM. However, it is not clear how to best transfer semantics detected by the vision encoder of the multi-modal model to the LM. We introduce two novel ways of constructing a linear mapping that successfully transfers semantics between the embedding spaces of the two pretrained models. The first aligns the embedding space of the multi-modal language encoder with the embedding space of the pretrained LM via token correspondences. The latter leverages additional data that consists of image-text pairs to construct the mapping directly from vision to language space. Using our semantic mappings, we unlock image captioning for LMs without access to gradient information. By using different sources of data we achieve strong captioning performance on MS-COCO and Flickr30k datasets. Even in the face of limited data, our method partly exceeds the performance of other zero-shot and even finetuned competitors. Our ablation studies show that even LMs at a scale of merely 250M parameters can generate decent captions employing our semantic mappings. Our approach makes image captioning more accessible for institutions with restricted computational resources.

翻译：对图像进行文本和语义的理解是生成恰当描述的关键。这种理解需要检测物体、建模物体间关系、评估场景语义，并最终将提取的知识映射到语言空间中。为了在实现强大语言能力的同时确保良好的图像-语言映射，预训练语言模型（LMs）被条件化于允许图像输入的预训练多模态（图像-文本）模型上。这要求多模态模型的图像表示与生成式LM的语言表示实现对齐。然而，如何将多模态模型视觉编码器检测到的语义最佳地传递给LM尚不明确。我们提出了两种构建线性映射的新方法，能够成功地在两种预训练模型的嵌入空间之间传递语义。第一种方法通过标记对应关系对齐多模态语言编码器的嵌入空间与预训练LM的嵌入空间。第二种方法利用由图像-文本对组成的额外数据，直接构建从视觉空间到语言空间的映射。通过使用我们的语义映射，我们在无需访问梯度信息的情况下，实现了LM的图像描述功能。利用不同数据源，我们在MS-COCO和Flickr30k数据集上取得了优异的描述性能。即使面对有限数据，我们的方法也在部分指标上超越了其他零样本方法甚至微调后的竞争模型。消融研究表明，即使仅使用2.5亿参数的LM，借助我们的语义映射也能生成得体的图像描述。我们的方法使计算资源受限的研究机构也能更便利地开展图像描述研究。