Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. Current image captioning approaches heavily rely on high-quality image-caption pairs, which can be hard to obtain for many domains. To address this, we introduce a self-supervised image captioning method. After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data, leveraging the auxiliary task of enhancing the CLIP relevance between images and generated captions. Remarkably, despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset. Human evaluations further reveal that our method produces captions with greater distinctiveness and informativeness, two attributes inherently challenging to achieve through supervised learning.
翻译:图像描述生成作为视觉语言理解中的基础任务,旨在为给定图像生成准确的自然语言描述。当前的图像描述生成方法高度依赖高质量的图像-描述对,这在许多领域难以获取。为解决这一问题,我们提出一种自监督图像描述方法。该方法首先从少量标注数据集中学习初始信号,随后过渡到在未标注数据上进行自监督学习,通过增强图像与生成描述之间CLIP相关性的辅助任务来优化模型。值得注意的是,尽管仅使用不到2%的COCO标注数据集,我们的方法仍能达到在完整数据集上训练的先进模型的性能水平。人工评估进一步表明,我们的方法能够生成更具独特性和信息量的描述,而这两个属性恰恰是监督学习难以实现的。