We study the implicit bias of the general family of steepest descent algorithms, which includes gradient descent, sign descent and coordinate descent, in deep homogeneous neural networks. We prove that an algorithm-dependent geometric margin starts increasing once the networks reach perfect training accuracy and characterize the late-stage bias of the algorithms. In particular, we define a generalized notion of stationarity for optimization problems and show that the algorithms progressively reduce a (generalized) Bregman divergence, which quantifies proximity to such stationary points of a margin-maximization problem. We then experimentally zoom into the trajectories of neural networks optimized with various steepest descent algorithms, highlighting connections to the implicit bias of Adam.
翻译:我们研究了深度齐次神经网络中一般最速下降算法族(包括梯度下降、符号下降和坐标下降)的隐式偏差。我们证明,一旦网络达到完美的训练精度,一种依赖于算法的几何边缘便开始增加,并刻画了算法在后期阶段的偏差。具体而言,我们为优化问题定义了一个广义的平稳性概念,并证明算法逐步减小一个(广义的)Bregman散度,该散度量化了与边缘最大化问题此类平稳点的接近程度。随后,我们通过实验深入观察了使用各种最速下降算法优化的神经网络轨迹,并强调了其与Adam算法隐式偏差的联系。