The energy landscape of high-dimensional non-convex optimization problems is crucial to understanding the effectiveness of modern deep neural network architectures. Recent works have experimentally shown that two different solutions found after two runs of a stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a framework theoretically explaining this empirical observation. Based on convergence rates in Wasserstein distance of empirical measures, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we express upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates is correlated with linear mode connectivity.
翻译:高维非凸优化问题的能量景观对于理解现代深度神经网络架构的有效性至关重要。近期实验研究表明,随机训练两次后得到的不同解往往通过权重置换意义下的简单连续路径(例如线性路径)相连。本文提供了一个理论框架来解释这一经验观察。基于经验测度在Wasserstein距离下的收敛速率,我们证明:以高概率而言,两个足够宽且使用随机梯度下降训练的两层神经网络是线性连通的。此外,我们给出了具有独立神经元权重的两层深度神经网络各层宽度需满足的上界与下界,以确保其线性连通性。最后,通过实证分析表明,决定Wasserstein收敛速率的神经元权重分布支撑维度与线性模式连通性之间存在相关性,从而验证了本文方法的有效性。