This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR, which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major steps. First, a codebook-based encoder-decoder captioner is developed to select representative samples. Second, a new caption is generated to complement the original captions for selected samples, mitigating the text-image misalignment problem while maintaining uniqueness. As the result, TL;DR enables us to reduce the large dataset into a small set of high-quality data, which can serve as an alternative pre-training dataset. This algorithm significantly speeds up the time-consuming pretraining process. Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e.g., reduce well-cleaned CC3M dataset from 2.82M to 0.67M ($\sim$24\%) and noisy YFCC15M from 15M to 2.5M ($\sim$16.7\%). Extensive experiments with three popular VLP models over seven downstream tasks show that VLP model trained on the compressed dataset provided by TL;DR can perform similar or even better results compared with training on the full-scale dataset. The code will be made available at \url{https://github.com/showlab/data-centric.vlp}.
翻译:本文探讨了广泛使用的大规模视觉-语言预训练(VLP)数据集中存在的严重图像-文本错配和高冗余问题。为解决这些问题,我们提出了一种高效且简洁的视觉-语言学习算法,称为TL;DR,旨在将现有大规模VLP数据压缩为小型、高质量的数据集。我们的方法包含两个主要步骤:首先,开发了一种基于码本的编码器-解码器字幕生成器,用于选择代表性样本;其次,为所选样本生成新的字幕以补充原始字幕,从而在保持独特性的同时缓解文本-图像错配问题。由此,TL;DR能够将大规模数据集缩减为小型高质量数据集,可作为替代的预训练数据集。该算法显著加速了耗时的预训练过程。具体而言,TL;DR能以高压缩比处理主流VLP数据集,例如将经过良好清洗的CC3M数据集从282万缩减至67万(约24%),并将噪声较多的YFCC15M数据集从1500万缩减至250万(约16.7%)。在三个主流VLP模型上针对七个下游任务的广泛实验表明,使用TL;DR提供的压缩数据集训练的VLP模型,其性能可与全量数据集训练的结果相当甚至更优。代码将在\url{https://github.com/showlab/data-centric.vlp}上公开。