Self-Supervised Contrastive Learning has proven effective in deriving high-quality representations from unlabeled data. However, a major challenge that hinders both unimodal and multimodal contrastive learning is feature suppression, a phenomenon where the trained model captures only a limited portion of the information from the input data while overlooking other potentially valuable content. This issue often leads to indistinguishable representations for visually similar but semantically different inputs, adversely affecting downstream task performance, particularly those requiring rigorous semantic comprehension. To address this challenge, we propose a novel model-agnostic Multistage Contrastive Learning (MCL) framework. Unlike standard contrastive learning which inherently captures one single biased feature distribution, MCL progressively learns previously unlearned features through feature-aware negative sampling at each stage, where the negative samples of an anchor are exclusively selected from the cluster it was assigned to in preceding stages. Meanwhile, MCL preserves the previously well-learned features by cross-stage representation integration, integrating features across all stages to form final representations. Our comprehensive evaluation demonstrates MCL's effectiveness and superiority across both unimodal and multimodal contrastive learning, spanning a range of model architectures from ResNet to Vision Transformers (ViT). Remarkably, in tasks where the original CLIP model has shown limitations, MCL dramatically enhances performance, with improvements up to threefold on specific attributes in the recently proposed MMVP benchmark.
翻译:自监督对比学习已被证明能够从无标签数据中提取高质量表征。然而,无论是单模态还是多模态对比学习,都面临一个主要挑战——特征抑制现象,即训练后的模型仅捕获输入数据中的有限信息,而忽略了其他潜在有价值的内容。该问题常导致视觉相似但语义不同的输入产生难以区分的表征,进而损害下游任务性能,尤其影响需要精确语义理解的任务。为应对这一挑战,我们提出了一种新颖的与模型无关的多阶段对比学习框架。与标准对比学习仅能捕获单一有偏特征分布不同,MCL通过各阶段特征感知的负采样逐步学习先前未掌握的特征——其中锚点的负样本专门从前序阶段分配到的聚类中选取。同时,MCL通过跨阶段表征整合来保留已充分学习的特征,集成所有阶段的特征以形成最终表征。我们的综合评估表明,MCL在从ResNet到视觉Transformer的多种模型架构中,对单模态与多模态对比学习均展现出卓越的有效性与优越性。值得注意的是,在原始CLIP模型表现受限的任务中,MCL显著提升了性能,在最新提出的MMVP基准测试中,特定属性指标最高提升达三倍。