Vision-language (VL) Pre-training (VLP) has shown to well generalize VL models over a wide range of VL downstream tasks, especially for cross-modal retrieval. However, it hinges on a huge amount of image-text pairs, which requires tedious and costly curation. On the contrary, weakly-supervised VLP (W-VLP) explores means with object tags generated by a pre-trained object detector (OD) from images. Yet, they still require paired information, i.e. images and object-level annotations, as supervision to train an OD. To further reduce the amount of supervision, we propose Prompts-in-The-Loop (PiTL) that prompts knowledge from large language models (LLMs) to describe images. Concretely, given a category label of an image, e.g. refinery, the knowledge, e.g. a refinery could be seen with large storage tanks, pipework, and ..., extracted by LLMs is used as the language counterpart. The knowledge supplements, e.g. the common relations among entities most likely appearing in a scene. We create IN14K, a new VL dataset of 9M images and 1M descriptions of 14K categories from ImageNet21K with PiTL. Empirically, the VL models pre-trained with PiTL-generated pairs are strongly favored over other W-VLP works on image-to-text (I2T) and text-to-image (T2I) retrieval tasks, with less supervision. The results reveal the effectiveness of PiTL-generated pairs for VLP.
翻译:视觉-语言(VL)预训练已被证明能有效泛化至多种VL下游任务,尤其是跨模态检索。然而,该方法依赖于海量图像-文本对,需要繁琐且昂贵的标注工作。相反,弱监督VLP(W-VLP)通过预训练的目标检测器从图像中生成物体标签来探索新途径。即便如此,它们仍需配对信息(即图像和物体级标注)作为监督信号来训练目标检测器。为进一步降低监督需求,我们提出循环提示(PiTL)方法,利用大型语言模型中的知识描述图像。具体而言,给定图像的类别标签(如“炼油厂”),由LLM提取的知识(如“炼油厂可见大型储罐、管道系统等”)被用作语言对应项。这些知识补充了场景中最可能出现的实体间常见关系。我们利用PiTL创建了IN14K数据集——包含来自ImageNet21K的900万张图像和14K类别的100万条描述的新VL数据集。实验表明,在图像到文本和文本到图像检索任务中,使用PiTL生成配对预训练的VL模型以更少监督显著优于其他W-VLP方法。结果揭示了PiTL生成配对在VLP中的有效性。