Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic modalities. Due to noises in web-harvested text-image pairs, however, scaling up training data volume in SMCL presents considerable obstacles in terms of computational cost and data inefficiency. To improve data efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates mix-based data augmentation techniques into SMCL, yielding significant performance improvements without significantly increasing computational overhead. We provide a theoretical analysis of TiMixfrom a mutual information (MI) perspective, showing that mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss. The experimental results demonstrate that TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods. This work empirically and theoretically demonstrates the potential of data mixing for data-efficient and computationally viable VLP, benefiting broader VLP model adoption in practical scenarios.
翻译:自监督多模态对比学习通过对齐视觉与语言模态显著推动了现代视觉-语言预训练模型的发展。然而,由于网络采集的文本-图像对存在噪声,扩大自监督多模态对比学习的训练数据量在计算成本与数据效率方面面临巨大挑战。为提升视觉-语言预训练的数据效率,我们提出文本感知图像混合方法,将基于混合的数据增强技术融入自监督多模态对比学习,在不显著增加计算开销的情况下实现性能的大幅提升。我们从互信息角度对TiMix进行理论分析,表明跨模态对比学习中混合数据样本隐含地为对比损失函数提供了正则化约束。实验结果表明,与现有方法相比,即使减少训练数据量并缩短训练时间,TiMix在下游任务中仍能达到相当的性能。本研究从实证和理论两个层面证明了数据混合技术在实现数据高效且计算可行的视觉-语言预训练方面的潜力,有利于推动视觉-语言预训练模型在实际场景中的广泛应用。