The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a kNN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale.
翻译:图像描述生成模型的目标是通过生成准确反映输入图像内容的自然语言描述,来弥合视觉与语言模态之间的鸿沟。近年来,研究者利用基于深度学习的模型,在视觉特征提取和多模态连接设计方面取得了进展,以应对这一任务。本文提出了一种新颖的方法来开发图像描述生成模型,该方法利用外部kNN记忆库来改进生成过程。具体而言,我们提出了两种模型变体,它们包含一个基于视觉相似性的知识检索组件、一个用于表示输入图像的可微分编码器,以及一个kNN增强的语言模型,该模型基于上下文线索和从外部记忆库检索的文本进行词元预测。我们在COCO和nocaps数据集上通过实验验证了我们的方法,结果表明引入显式外部记忆库能够显著提升描述质量,尤其是在使用更大规模检索语料库时。这项工作为检索增强的描述生成模型提供了有价值的见解,并为更大规模地改进图像描述生成开辟了新途径。