In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features.
翻译:本文中,我们刻画了在随机初始化后,经逻辑损失函数训练的二层ReLU网络在梯度下降过程中的特性学习机制。我们考虑输入特征经由类XOR函数生成的二元标签数据,并允许训练标签中固定比例被敌对者污染。研究表明:尽管对于我们所考察的数据分布,线性分类器的表现并不优于随机猜测,但经梯度下降训练的二层ReLU网络仍能实现接近标签噪声率的泛化误差。我们提出了一种全新的证明技术,该技术表明:在初始化阶段,绝大多数神经元仅作为与有用特征弱相关的随机特征发挥作用,而梯度下降动力学将这些弱相关随机特征"放大"为强相关的有用特征。