Contrastive learning (CL) has emerged as a powerful technique for representation learning, with or without label supervision. However, supervised CL is prone to collapsing representations of subclasses within a class by not capturing all their features, and unsupervised CL may suppress harder class-relevant features by focusing on learning easy class-irrelevant features; both significantly compromise representation quality. Yet, there is no theoretical understanding of \textit{class collapse} or \textit{feature suppression} at \textit{test} time. We provide the first unified theoretically rigorous framework to determine \textit{which} features are learnt by CL. Our analysis indicate that, perhaps surprisingly, bias of (stochastic) gradient descent towards finding simpler solutions is a key factor in collapsing subclass representations and suppressing harder class-relevant features. Moreover, we present increasing embedding dimensionality and improving the quality of data augmentations as two theoretically motivated solutions to {feature suppression}. We also provide the first theoretical explanation for why employing supervised and unsupervised CL together yields higher-quality representations, even when using commonly-used stochastic gradient methods.
翻译:对比学习(Contrastive Learning, CL)已成为表征学习中的一项强大技术,可在有无标签监督的情况下使用。然而,有监督CL倾向于无法捕获类别内所有子类特征而导致子类表征坍缩,而无监督CL则可能因聚焦于学习简单的类别无关特征而抑制更难挖掘的类别相关特征;这两种情况均显著损害表征质量。然而,目前尚无理论框架能够解释测试时的\textit{类别坍缩}或\textit{特征抑制}现象。我们首次提出统一且严格的理论框架,用以确定对比学习究竟\textit{学习哪些特征}。我们的分析表明,令人惊讶的是,(随机)梯度下降倾向于寻找更简单解这一偏差,是导致子类表征坍缩与抑制更难类别相关特征的关键因素。此外,我们提出增加嵌入维度与提升数据增强质量这两种具有理论基础的方案以缓解{特征抑制}。我们还首次从理论上解释了为何联合使用有监督与无监督对比学习能产生更高质量的表征——即使在使用常见的随机梯度方法时也是如此。