Training an image captioner without annotated image-sentence pairs has gained traction in recent years. Previous approaches can be categorized into two strategies: crawling sentences from mismatching corpora and aligning them with the given images as pseudo annotations, or pre-training the captioner using external image-text pairs. However, the aligning setting seems to reach its performance limit due to the quality problem of pairs, and pre-training requires significant computational resources. To address these challenges, we propose a new strategy ``LPM + retrieval-augmented learning" where the prior knowledge from large pre-trained models (LPMs) is leveraged as supervision, and a retrieval process is integrated to further reinforce its effectiveness. Specifically, we introduce Retrieval-augmented Pseudo Sentence Generation (RaPSG), which adopts an efficient approach to retrieve highly relevant short region descriptions from the mismatching corpora and use them to generate a variety of pseudo sentences with distinct representations as well as high quality via LPMs. In addition, a fluency filter and a CLIP-guided training objective are further introduced to facilitate model optimization. Experimental results demonstrate that our method surpasses the SOTA pre-training model (Flamingo3B) by achieving a CIDEr score of 78.1 (+5.1) while utilizing only 0.3% of its trainable parameters (1.3B VS 33M). Importantly, our approach eliminates the need of computationally expensive pre-training processes on external datasets (e.g., the requirement of 312M image-text pairs for Flamingo3B). We further show that with a simple extension, the generated pseudo sentences can be deployed as weak supervision to boost the 1% semi-supervised image caption benchmark up to 93.4 CIDEr score (+8.9) which showcases the versatility and effectiveness of our approach.
翻译:近年来,无需标注图像-句子对训练图像描述模型的研究逐渐兴起。现有方法可分为两类:从非匹配语料库中抓取句子,将其与给定图像对齐作为伪标注;或利用外部图像-文本对预训练描述器。然而,由于图像-文本对的质量问题,对齐方法似乎已达性能瓶颈,而预训练方式则需消耗大量计算资源。为解决这些挑战,我们提出一种新策略“LPM+检索增强学习”,利用大规模预训练模型(LPMs)的先验知识作为监督信号,并引入检索过程进一步强化其有效性。具体而言,我们提出检索增强伪句子生成(RaPSG),该方法采用高效方式从非匹配语料库中检索高度相关的短区域描述,通过LPMs生成具有不同表征且高质量的多样伪句子。此外,我们进一步引入流畅性过滤器和CLIP引导的训练目标以优化模型。实验结果表明,我们的方法仅需使用0.3%的可训练参数(13亿 vs 3300万),即以78.1的CIDEr分数(+5.1)超越现有最佳预训练模型(Flamingo3B)。更重要的是,该方法无需在外部数据集上进行昂贵的预训练(例如Flamingo3B所需的3.12亿图像-文本对)。我们进一步证明,通过简单扩展,生成的伪句子可作为弱监督信号,将1%半监督图像描述基准的CIDEr分数提升至93.4(+8.9),充分展示了本方法的通用性和有效性。