Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

Training an image captioner without annotated image-sentence pairs has gained traction in recent years. Previous approaches can be categorized into two strategies: crawling sentences from mismatching corpora and aligning them with the given images as pseudo annotations, or pre-training the captioner using external image-text pairs. However, the aligning setting seems to reach its performance limit due to the quality problem of pairs, and pre-training requires significant computational resources. To address these challenges, we propose a new strategy ``LPM + retrieval-augmented learning" where the prior knowledge from large pre-trained models (LPMs) is leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which adopts an efficient approach to retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of pseudo sentences with distinct representations as well as high quality via LPMs. In addition, a fluency filter and a CLIP-guided training objective are further introduced to facilitate model optimization. Experimental results demonstrate that our method surpasses the SOTA pre-training model (Flamingo3B) by achieving a CIDEr score of 78.1 (+5.1) while utilizing only 0.3% of its trainable parameters (1.3B VS 33M). Importantly, our approach eliminates the need of computationally expensive pre-training processes on external datasets (e.g., the requirement of 312M image-text pairs for Flamingo3B). We further show that with a simple extension, the generated pseudo sentences can be deployed as weak supervision to boost the 1% semi-supervised image caption benchmark up to 93.4 CIDEr score (+8.9) which showcases the versatility and effectiveness of our approach.

翻译：近年来，无需标注图像-句子对训练图像描述模型的研究逐渐兴起。现有方法可分为两类：从非匹配语料库中抓取句子，将其与给定图像对齐作为伪标注；或利用外部图像-文本对预训练描述器。然而，由于图像-文本对的质量问题，对齐方法似乎已达性能瓶颈，而预训练方式则需消耗大量计算资源。为解决这些挑战，我们提出一种新策略“LPM+检索增强学习”，利用大规模预训练模型（LPMs）的先验知识作为监督信号，并引入检索过程进一步强化其有效性。具体而言，我们提出检索增强伪句子生成（RaPSG），该方法采用高效方式从非匹配语料库中检索高度相关的短区域描述，通过LPMs生成具有不同表征且高质量的多样伪句子。此外，我们进一步引入流畅性过滤器和CLIP引导的训练目标以优化模型。实验结果表明，我们的方法仅需使用0.3%的可训练参数（13亿 vs 3300万），即以78.1的CIDEr分数（+5.1）超越现有最佳预训练模型（Flamingo3B）。更重要的是，该方法无需在外部数据集上进行昂贵的预训练（例如Flamingo3B所需的3.12亿图像-文本对）。我们进一步证明，通过简单扩展，生成的伪句子可作为弱监督信号，将1%半监督图像描述基准的CIDEr分数提升至93.4（+8.9），充分展示了本方法的通用性和有效性。