SimCLR is one of the most popular contrastive learning methods for vision tasks. It pre-trains deep neural networks based on a large amount of unlabeled data by teaching the model to distinguish between positive and negative pairs of augmented images. It is believed that SimCLR can pre-train a deep neural network to learn efficient representations that can lead to a better performance of future supervised fine-tuning. Despite its effectiveness, our theoretical understanding of the underlying mechanisms of SimCLR is still limited. In this paper, we theoretically introduce a case study of the SimCLR method. Specifically, we consider training a two-layer convolutional neural network (CNN) to learn a toy image data model. We show that, under certain conditions on the number of labeled data, SimCLR pre-training combined with supervised fine-tuning achieves almost optimal test loss. Notably, the label complexity for SimCLR pre-training is far less demanding compared to direct training on supervised data. Our analysis sheds light on the benefits of SimCLR in learning with fewer labels.
翻译:SimCLR是视觉任务中最流行的对比学习方法之一。该方法通过教导模型区分增强图像的正负对,基于大量无标签数据对深度神经网络进行预训练。人们认为SimCLR能够预训练深度神经网络以学习高效表征,从而提升后续监督微调的性能。尽管其效果显著,我们对SimCLR底层机制的理论理解仍然有限。本文从理论角度对SimCLR方法进行了案例研究。具体而言,我们考虑训练一个双层卷积神经网络(CNN)来学习一个玩具图像数据模型。我们证明,在特定标注数据量的条件下,SimCLR预训练结合监督微调能够达到近乎最优的测试损失。值得注意的是,相较于直接在监督数据上训练,SimCLR预训练所需的标签复杂度要求显著降低。我们的分析揭示了SimCLR在少标签学习中的优势。