This paper proposes a theoretical framework to evaluate and compare the performance of gradient-descent algorithms for distributed learning in relation to their behavior around local minima in nonconvex environments. Previous works have noticed that convergence toward flat local minima tend to enhance the generalization ability of learning algorithms. This work discovers two interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minimizers and favor convergence toward flatter minima relative to the centralized solution in the large-batch training regime. Second, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimizer but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies of the diffusion type deliver enhanced classification accuracy because it strikes a more favorable balance between flatness and optimization performance.
翻译:本文提出一个理论框架,用于评估和比较分布式学习中梯度下降算法在非凸环境下围绕局部极小值行为的性能。先前研究已注意到,向平坦局部极小值的收敛往往能提升学习算法的泛化能力。本研究发现两个重要结果:首先,在大批量训练机制下,去中心化学习策略能够更快地逃离局部极小点,并相较于集中式方案更倾向于收敛至更平坦的极小值;其次,更重要的是,最终分类精度不仅取决于局部极小值的平坦程度,还取决于学习算法逼近该极小值的能力。换言之,分类精度是平坦性与优化性能的共同函数。本文深入考察了平坦性度量与优化误差之间的相互作用机制,得出一个重要结论:扩散型去中心化策略之所以能提供更高的分类精度,是因为其在平坦性与优化性能之间取得了更有利的平衡。