Minimizing cross-entropy over the softmax scores of a linear map composed with a high-capacity encoder is arguably the most popular choice for training neural networks on supervised learning tasks. However, recent works show that one can directly optimize the encoder instead, to obtain equally (or even more) discriminative representations via a supervised variant of a contrastive objective. In this work, we address the question whether there are fundamental differences in the sought-for representation geometry in the output space of the encoder at minimal loss. Specifically, we prove, under mild assumptions, that both losses attain their minimum once the representations of each class collapse to the vertices of a regular simplex, inscribed in a hypersphere. We provide empirical evidence that this configuration is attained in practice and that reaching a close-to-optimal state typically indicates good generalization performance. Yet, the two losses show remarkably different optimization behavior. The number of iterations required to perfectly fit to data scales superlinearly with the amount of randomly flipped labels for the supervised contrastive loss. This is in contrast to the approximately linear scaling previously reported for networks trained with cross-entropy.
翻译:将线性映射与高容量编码器组合后的softmax得分上的交叉熵最小化,可能是训练神经网络完成监督学习任务最流行的选择。然而,近期研究表明,可以直接优化编码器,通过对比目标的监督变体获得同等(甚至更优)判别性的表示。本研究探讨一个核心问题:在最小损失下,编码器输出空间中寻求的表示几何是否存在根本性差异?具体而言,我们在温和假设下证明,当每个类别的表示坍缩至内接于超球面的正则单纯形顶点时,两种损失均达到最小值。我们提供的经验证据表明,该配置在实践中可实现,且接近最优状态通常预示着良好的泛化性能。然而,这两种损失展现出截然不同的优化行为:监督对比损失完美拟合数据所需的迭代次数与随机翻转标签数量呈超线性关系,这与先前报道的交叉熵训练网络的近似线性缩放形成鲜明对比。