Logit regularization, the addition of a convex penalty directly in logit space, is widely used in modern classifiers, with label smoothing as a prominent example. While such methods often improve calibration and generalization, their mechanism remains under-explored. In this work, we analyze a general class of such logit regularizers in the context of linear classification, and demonstrate that they induce an implicit bias of logit clustering around finite per-sample targets. For Gaussian data, or whenever logits are sufficiently clustered, we prove that logit clustering drives the weight vector to align exactly with Fisher's Linear Discriminant. To demonstrate the consequences, we study a simple signal-plus-noise model in which this transition has dramatic effects: Logit regularization halves the critical sample complexity and induces grokking in the small-noise limit, while making generalization robust to noise. Our results extend the theoretical understanding of label smoothing and highlight the efficacy of a broader class of logit-regularization methods.
翻译:Logit正则化,即在logit空间直接添加凸惩罚项,是现代分类器中广泛采用的技术,其中标签平滑是典型代表。尽管此类方法通常能提升校准效果与泛化性能,其作用机制仍未得到充分探索。在本研究中,我们在线性分类框架下分析了一类通用的logit正则化方法,证明其会诱导logit围绕有限样本目标形成聚类的隐式偏差。对于高斯数据或当logit充分聚类时,我们严格证明了这种聚类效应会驱使权重向量精确对齐费希尔线性判别式。为阐明其影响,我们研究了一个简单的信号加噪声模型,其中该转变会产生显著效应:Logit正则化能将临界样本复杂度降低一半,并在小噪声极限下诱导顿悟现象,同时使泛化性能对噪声具有鲁棒性。我们的研究拓展了对标签平滑的理论理解,并揭示了一类更广泛的logit正则化方法的有效性。