Neural networks can be trained to solve regression problems by using gradient-based methods to minimize the square loss. However, practitioners often prefer to reformulate regression as a classification problem, observing that training on the cross entropy loss results in better performance. By focusing on two-layer ReLU networks, which can be fully characterized by measures over their feature space, we explore how the implicit bias induced by gradient-based optimization could partly explain the above phenomenon. We provide theoretical evidence that the regression formulation yields a measure whose support can differ greatly from that for classification, in the case of one-dimensional data. Our proposed optimal supports correspond directly to the features learned by the input layer of the network. The different nature of these supports sheds light on possible optimization difficulties the square loss could encounter during training, and we present empirical results illustrating this phenomenon.
翻译:神经网络可通过基于梯度的优化方法最小化平方损失来训练以解决回归问题。然而,实践者通常倾向于将回归问题重构为分类问题,观察到基于交叉熵损失训练能获得更优性能。通过聚焦于可通过特征空间上的测度完全表征的两层ReLU网络,我们探究了梯度优化诱导的隐式偏差如何部分解释上述现象。我们提供理论证据表明,在一维数据情形下,回归形式化产生的测度支撑集可能与分类情形存在显著差异。我们提出的最优支撑集直接对应于网络输入层学习的特征。这些支撑集的本质差异揭示了平方损失在训练中可能遇到的优化困难,并通过实验结果展示了这一现象。