Feature learning strength (FLS), i.e., the inverse of the effective output scaling of a model, plays a critical role in shaping the optimization dynamics of neural nets. While its impact has been extensively studied under the asymptotic regimes -- both in training time and FLS -- existing theory offers limited insight into how FLS affects generalization in practical settings, such as when training is stopped upon reaching a target training risk. In this work, we investigate the impact of FLS on generalization in deep networks under such practical conditions. Through empirical studies, we first uncover the emergence of an $\textit{optimal FLS}$ -- neither too small nor too large -- that yields substantial generalization gains. This finding runs counter to the prevailing intuition that stronger feature learning universally improves generalization. To explain this phenomenon, we develop a theoretical analysis of gradient flow dynamics in two-layer ReLU nets trained with logistic loss, where FLS is controlled via initialization scale. Our main theoretical result establishes the existence of an optimal FLS arising from a trade-off between two competing effects: An excessively large FLS induces an $\textit{over-alignment}$ phenomenon that degrades generalization, while an overly small FLS leads to $\textit{over-fitting}$.
翻译:特征学习强度(FLS),即模型有效输出缩放的倒数,在塑造神经网络的优化动态中起着关键作用。尽管其在渐近机制下(包括训练时间和FLS两方面)的影响已得到广泛研究,但现有理论对于FLS如何在诸如训练达到目标训练风险即停止等实际设置中影响泛化的见解有限。在本工作中,我们研究了在此类实际条件下FLS对深度网络泛化的影响。通过实证研究,我们首先揭示了一个既不过小也不过大、能带来显著泛化增益的“最优FLS”的出现。这一发现与“更强的特征学习普遍改善泛化”的普遍直觉相悖。为解释此现象,我们针对使用逻辑损失训练的两层ReLU网络,通过初始化尺度控制FLS,对其梯度流动态进行了理论分析。我们的主要理论结果证实了最优FLS的存在,它源于两种竞争效应之间的权衡:过大的FLS会引发损害泛化的“过对齐”现象,而过小的FLS则导致“过拟合”。