Rectified linear unit (ReLU), as a non-linear activation function, is well known to improve the expressivity of neural networks such that any continuous function can be approximated to arbitrary precision by a sufficiently wide neural network. In this work, we present another interesting and important feature of ReLU activation function. We show that ReLU leads to: {\it better separation} for similar data, and {\it better conditioning} of neural tangent kernel (NTK), which are closely related. Comparing with linear neural networks, we show that a ReLU activated wide neural network at random initialization has a larger angle separation for similar data in the feature space of model gradient, and has a smaller condition number for NTK. Note that, for a linear neural network, the data separation and NTK condition number always remain the same as in the case of a linear model. Furthermore, we show that a deeper ReLU network (i.e., with more ReLU activation operations), has a smaller NTK condition number than a shallower one. Our results imply that ReLU activation, as well as the depth of ReLU network, helps improve the gradient descent convergence rate, which is closely related to the NTK condition number.
翻译:线性整流单元(ReLU)作为非线性激活函数,因其能够使足够宽的神经网络以任意精度逼近任意连续函数而广为人知。本文揭示了ReLU激活函数的另一重要特性:我们证明ReLU能够实现相似数据的更优分离,并改善神经正切核(NTK)的条件数,且两者密切相关。与线性神经网络相比,我们发现在随机初始化条件下,采用ReLU激活的宽神经网络在模型梯度特征空间中对相似数据具有更大的角度间隔,同时NTK条件数更小。值得注意的是,对于线性神经网络,数据分离度和NTK条件数始终与线性模型保持一致。此外,我们证明深层ReLU网络(即包含更多ReLU激活操作)的NTK条件数小于浅层网络。研究结果表明,ReLU激活及其网络深度有助于提升梯度下降收敛速率,而这一收敛速率与NTK条件数密切相关。