Multilingual image captioning has recently been tackled by training with large-scale machine translated data, which is an expensive, noisy, and time-consuming process. Without requiring any multilingual caption data, we propose LMCap, an image-blind few-shot multilingual captioning model that works by prompting a language model with retrieved captions. Specifically, instead of following the standard encoder-decoder paradigm, given an image, LMCap first retrieves the captions of similar images using a multilingual CLIP encoder. These captions are then combined into a prompt for an XGLM decoder, in order to generate captions in the desired language. In other words, the generation model does not directly process the image, instead processing retrieved captions. Experiments on the XM3600 dataset of geographically diverse images show that our model is competitive with fully-supervised multilingual captioning models, without requiring any supervised training on any captioning data.
翻译:多语言图像描述最近通过使用大规模机器翻译数据进行训练来实现,但这一过程成本高昂、存在噪声且耗时。我们提出LMCap,一种无需任何多语言描述数据的图像盲少样本多语言描述模型,该模型通过使用检索到的描述提示语言模型工作。具体而言,不遵循标准的编码器-解码器范式,而是给定一张图像,LMCap首先使用多语言CLIP编码器检索相似图像的描述。这些描述随后被组合成XGLM解码器的提示,以生成目标语言的描述。换句话说,生成模型不直接处理图像,而是处理检索到的描述。在地理多样性图像数据集XM3600上的实验表明,我们的模型与完全监督的多语言描述模型具有竞争力,且无需对任何描述数据进行监督训练。