During the preceding biennium, vision-language pre-training has achieved noteworthy success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs, where the pairs are entirely exclusive of each other, remains a challenging task, and noise exists in the commonly used datasets. To address this issue, we propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment by introducing a softened target, which is generated from the fine-grained intra-modal self-similarity. The intra-modal guidance is indicative to enable two pairs have some local similarities and model many-to-many relationships between the two modalities. Besides, since the positive still dominates in the softened target distribution, we disentangle the negatives in the distribution to further boost the relation alignment with the negatives in the cross-modal learning. Extensive experiments demonstrate the effectiveness of SoftCLIP. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2% over the CLIP baseline.
翻译:在过去两年中,视觉-语言预训练在多个下游任务上取得了显著成功。然而,获取完全互斥的高质量图像-文本对仍然是一项具有挑战性的任务,常用数据集中也普遍存在噪声。为解决这一问题,我们提出SoftCLIP——一种通过引入软化目标来放松严格一对一约束、实现柔性跨模态对齐的新方法。该软化目标由细粒度的模态内自相似性生成,其模态内引导信号能够指示两个配对间存在局部相似性,并建模两种模态间的多对多关系。此外,由于正向样本在软化目标分布中仍占主导地位,我们通过解耦分布中的负向样本进一步强化跨模态学习中的负向关系对齐。大量实验证明了SoftCLIP的有效性。特别地,在ImageNet零样本分类任务中,以CC3M/CC12M为预训练数据集时,SoftCLIP相较CLIP基线分别带来了6.8%/7.2%的Top-1准确率提升。