Data augmentation is a crucial component in unsupervised contrastive learning (CL). It determines how positive samples are defined and, ultimately, the quality of the learned representation. In this work, we open the door to new perspectives for CL by integrating prior knowledge, given either by generative models -- viewed as prior representations -- or weak attributes in the positive and negative sampling. To this end, we use kernel theory to propose a novel loss, called decoupled uniformity, that i) allows the integration of prior knowledge and ii) removes the negative-positive coupling in the original InfoNCE loss. We draw a connection between contrastive learning and conditional mean embedding theory to derive tight bounds on the downstream classification loss. In an unsupervised setting, we empirically demonstrate that CL benefits from generative models to improve its representation both on natural and medical images. In a weakly supervised scenario, our framework outperforms other unconditional and conditional CL approaches.
翻译:数据增强是无监督对比学习中的关键组件。它决定了正样本的定义方式,并最终影响学习表征的质量。在本研究中,我们通过集成先验知识(无论是来自生成模型的先验表征,还是正负样本采样中的弱属性)为对比学习开辟了新视角。为此,我们利用核理论提出了一种名为解耦一致性的新型损失函数,该函数:i) 允许集成先验知识,以及 ii) 消除了原始InfoNCE损失中正负样本的耦合性。我们建立了对比学习与条件均值嵌入理论之间的联系,推导出下游分类损失的严格上界。在无监督场景下,实验证明对比学习借助生成模型可在自然图像和医学图像两方面提升表征质量。在弱监督场景中,我们的框架优于其他无条件和有条件的对比学习方法。