Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/
翻译:网络爬取的图像-文本对本质上存在噪声。先前研究表明,对这些图像-文本对的文本描述进行语义对齐和丰富,能够显著提升各类视觉-语言任务(尤其是文本到图像生成)的模型训练效果。然而,该领域的大规模研究仍主要处于闭源状态。本文旨在推动这一社区协作,利用强大且\textit{开源}的LLaMA-3(一个达到GPT-4水平的LLM)来开展相关工作。我们的重描述流程简洁明了:首先,我们微调了一个基于LLaMA-3-8B的LLaVA-1.5模型,随后用它为来自DataComp-1B数据集的13亿张图像重新生成描述。实证结果证实,这个增强后的数据集Recap-DataComp-1B能为训练先进的视觉-语言模型带来显著优势。对于CLIP这类判别式模型,我们观察到其在跨模态检索任务中的零样本性能得到提升。对于文本到图像的Diffusion Transformers这类生成式模型,生成的图像在遵循用户文本指令(尤其是复杂查询)方面表现出显著改善的图像-文本对齐度。我们的项目页面是 https://www.haqtu.me/Recap-Datacomp-1B/。