Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

Training an image captioner without annotated image-sentence pairs has gained traction in recent years. Previous approaches can be categorized into two strategies: crawling sentences from mismatching corpora and aligning them with the given images as pseudo annotations, or pre-training the captioner using external image-text pairs. However, the aligning setting seems to reach its performance limit due to the quality problem of pairs, and pre-training requires significant computational resources. To address these challenges, we propose a new strategy ``LPM + retrieval-augmented learning" where the prior knowledge from large pre-trained models (LPMs) is leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which adopts an efficient approach to retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of pseudo sentences with distinct representations as well as high quality via LPMs. In addition, a fluency filter and a CLIP-guided training objective are further introduced to facilitate model optimization. Experimental results demonstrate that our method surpasses the SOTA pre-training model (Flamingo3B) by achieving a CIDEr score of 78.1 (+5.1) while utilizing only 0.3% of its trainable parameters (1.3B VS 33M). Importantly, our approach eliminates the need of computationally expensive pre-training processes on external datasets (e.g., the requirement of 312M image-text pairs for Flamingo3B). We further show that with a simple extension, the generated pseudo sentences can be deployed as weak supervision to boost the 1% semi-supervised image caption benchmark up to 93.4 CIDEr score (+8.9) which showcases the versatility and effectiveness of our approach.

翻译：训练无需标注图像-文本对的图像描述模型近年来逐渐受到关注。现有方法主要分为两类：从非匹配语料库中抓取句子并与给定图像对齐作为伪标注，或利用外部图像-文本对预训练描述器。然而，由于配对质量问题，对齐方法似乎已达到性能极限，而预训练则需要大量计算资源。为应对这些挑战，我们提出"大预训练模型+检索增强学习"的新策略：利用大预训练模型（LPMs）的先验知识作为监督信号，并集成检索过程进一步增强其有效性。具体而言，我们引入检索增强伪句子生成（RaPSG），该方法通过高效方式从非匹配语料库中检索高相关性的短区域描述，借助LPMs生成表征多样且质量优越的多种伪句子。此外，我们还引入流畅度过滤器和CLIP引导的训练目标以优化模型。实验结果表明，我们的方法在仅使用1.3B参数量（对比同类模型的33M可训练参数，仅占0.3%）的情况下，以78.1的CIDEr分数（+5.1）超越当前最优预训练模型Flamingo3B。更重要的是，该方法消除了在外部数据集上进行昂贵预训练的必要（例如Flamingo3B需要3.12亿图像-文本对）。我们进一步证明，通过简单扩展，生成的伪句子可作为弱监督信号将1%半监督图像描述基准的CIDEr分数提升至93.4（+8.9），充分展示了本方法的普适性与有效性。