Inspired by retrieval-augmented language generation and pretrained Vision and Language (V&L) encoders, we present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore, as opposed to the image alone. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions. Experimental results on the COCO dataset show that image captioning can be effectively formulated from this new perspective. Our model, named EXTRA, benefits from using captions retrieved from the training dataset, and it can also benefit from using an external dataset without the need for retraining. Ablation studies show that retrieving a sufficient number of captions (e.g., k=5) can improve captioning quality. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.
翻译:受检索增强的语言生成与预训练视觉语言(V&L)编码器的启发,我们提出了一种新的图像描述方法——该方法不仅输入图像,还输入从数据存储中检索到的一组描述文本,而非仅依赖图像生成句子。模型中的编码器利用预训练的V&L BERT联合处理图像与检索到的描述文本,而解码器则关注多模态编码器的表示,从而从检索文本提供的额外文本证据中获益。在COCO数据集上的实验结果表明,图像描述任务可以从这一新视角得到有效构建。我们的模型名为EXTRA,不仅受益于从训练数据集中检索的描述文本,还能在无需重新训练的情况下利用外部数据集。消融研究表明,检索足够数量的描述文本(例如k=5)可提升描述质量。本研究为将预训练V&L编码器应用于生成任务(而非标准的分类任务)做出了贡献。