The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair's semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%.
翻译:对比学习的成功取决于高质量正样本对的构建与利用。然而,现有方法在两方面面临关键局限:在构建层面,无论是手工设计还是生成式增强方法,通常存在多样性有限且可能破坏语义完整性的风险;在学习层面,由于缺乏质量评估机制,所有样本对被平等对待,导致监督信号次优。为应对这些挑战,我们提出GenView++,一个通过引入两项协同创新来同时解决上述问题的统一框架。为改进样本对构建,GenView++提出多源自适应视图生成机制,通过动态调制图像条件、文本条件及图文条件三种策略的生成参数,合成多样性高且语义一致的视图。其次,质量驱动的对比学习机制评估每个样本对的语义对齐度与多样性,动态重加权其训练贡献,优先利用高质量样本对,同时抑制冗余或未对齐的样本对。大量实验验证了GenView++在视觉及视觉-语言任务上的有效性。对于视觉表征学习,其在ImageNet线性分类任务上将MoCov2性能提升2.5%。对于视觉-语言学习,其在十个数据集上的平均零样本分类准确率较CLIP提升12.31%,较SLIP提升5.31%,并将Flickr30k文本检索的R@5指标进一步提升3.2%。