Supervised contrastive loss (SCL) is a competitive and often superior alternative to the cross-entropy (CE) loss for classification. In this paper we ask: what differences in the learning process occur when the two different loss functions are being optimized? To answer this question, our main finding is that the geometry of embeddings learned by SCL forms an orthogonal frame (OF) regardless of the number of training examples per class. This is in contrast to the CE loss, for which previous work has shown that it learns embeddings geometries that are highly dependent on the class sizes. We arrive at our finding theoretically, by proving that the global minimizers of an unconstrained features model with SCL loss and entry-wise non-negativity constraints form an OF. We then validate the model's prediction by conducting experiments with standard deep-learning models on benchmark vision datasets. Finally, our analysis and experiments reveal that the batching scheme chosen during SCL training plays a critical role in determining the quality of convergence to the OF geometry. This finding motivates a simple algorithm wherein the addition of a few binding examples in each batch significantly speeds up the occurrence of the OF geometry.
翻译:有监督对比损失(SCL)是交叉熵(CE)损失在分类任务中一种具有竞争力且通常更优的替代方案。本文探究:在优化这两种不同损失函数时,学习过程中会出现哪些差异?为回答此问题,我们的核心发现是:无论每类训练样本数量如何,SCL学习到的嵌入几何结构均形成正交框架(OF)。这与CE损失形成鲜明对比——已有研究表明CE损失学习的嵌入几何结构高度依赖类别大小。我们从理论上推导出该发现:通过对具有SCL损失及逐项非负约束的无约束特征模型进行全局极小化分析,证明其必然形成OF。随后,我们通过标准深度学习模型在基准视觉数据集上的实验验证了该理论预测。最后,我们的分析与实验揭示:SCL训练过程中所选的批次方案对收敛至OF几何结构的质量起关键作用。该发现启发了一种简单算法——在每个批次中增加少量绑定样本,可显著加速OF几何结构的出现。