Training an image captioner without annotated image-sentence pairs has gained traction in recent years. Previous approaches can be categorized into two strategies: crawling sentences from mismatching corpora and aligning them with the given images as pseudo annotations, or pre-training the captioner using external image-text pairs. However, the aligning setting seems to reach its performance limit due to the quality problem of pairs, and pre-training requires significant computational resources. To address these challenges, we propose a new strategy ``LPM + retrieval-augmented learning" where the prior knowledge from large pre-trained models (LPMs) is leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which adopts an efficient approach to retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of pseudo sentences with distinct representations as well as high quality via LPMs. In addition, a fluency filter and a CLIP-guided training objective are further introduced to facilitate model optimization. Experimental results demonstrate that our method surpasses the SOTA pre-training model (Flamingo3B) by achieving a CIDEr score of 78.1 (+5.1) while utilizing only 0.3% of its trainable parameters (1.3B VS 33M). Importantly, our approach eliminates the need of computationally expensive pre-training processes on external datasets (e.g., the requirement of 312M image-text pairs for Flamingo3B). We further show that with a simple extension, the generated pseudo sentences can be deployed as weak supervision to boost the 1% semi-supervised image caption benchmark up to 93.4 CIDEr score (+8.9) which showcases the versatility and effectiveness of our approach.
翻译:训练无需标注图像-文本对的图像描述模型近年来逐渐受到关注。现有方法主要分为两类:从非匹配语料库中抓取句子并与给定图像对齐作为伪标注,或利用外部图像-文本对预训练描述器。然而,由于配对质量问题,对齐方法似乎已达到性能极限,而预训练则需要大量计算资源。为应对这些挑战,我们提出"大预训练模型+检索增强学习"的新策略:利用大预训练模型(LPMs)的先验知识作为监督信号,并集成检索过程进一步增强其有效性。具体而言,我们引入检索增强伪句子生成(RaPSG),该方法通过高效方式从非匹配语料库中检索高相关性的短区域描述,借助LPMs生成表征多样且质量优越的多种伪句子。此外,我们还引入流畅度过滤器和CLIP引导的训练目标以优化模型。实验结果表明,我们的方法在仅使用1.3B参数量(对比同类模型的33M可训练参数,仅占0.3%)的情况下,以78.1的CIDEr分数(+5.1)超越当前最优预训练模型Flamingo3B。更重要的是,该方法消除了在外部数据集上进行昂贵预训练的必要(例如Flamingo3B需要3.12亿图像-文本对)。我们进一步证明,通过简单扩展,生成的伪句子可作为弱监督信号将1%半监督图像描述基准的CIDEr分数提升至93.4(+8.9),充分展示了本方法的普适性与有效性。