We study the implicit bias of the general family of steepest descent algorithms with infinitesimal learning rate in deep homogeneous neural networks. We show that: (a) an algorithm-dependent geometric margin starts increasing once the networks reach perfect training accuracy, and (b) any limit point of the training trajectory corresponds to a KKT point of the corresponding margin-maximization problem. We experimentally zoom into the trajectories of neural networks optimized with various steepest descent algorithms, highlighting connections to the implicit bias of popular adaptive methods (Adam and Shampoo).
翻译:本研究探讨了在深度齐次神经网络中,采用无穷小学习率的一般梯度下降算法族的隐式偏差。我们证明:(a) 一旦网络达到完美的训练精度,算法依赖的几何间隔便开始增加;(b) 训练轨迹的任何极限点都对应于相应间隔最大化问题的KKT点。我们通过实验深入观察了使用不同梯度下降算法优化的神经网络训练轨迹,并强调了其与流行自适应方法(Adam和Shampoo)隐式偏差的关联。