Empirical studies of the loss landscape of deep networks have revealed that many local minima are connected through low-loss valleys. Yet, little is known about the theoretical origin of such valleys. We present a general framework for finding continuous symmetries in the parameter space, which carve out low-loss valleys. Our framework uses equivariances of the activation functions and can be applied to different layer architectures. To generalize this framework to nonlinear neural networks, we introduce a novel set of nonlinear, data-dependent symmetries. These symmetries can transform a trained model such that it performs similarly on new samples, which allows ensemble building that improves robustness under certain adversarial attacks. We then show that conserved quantities associated with linear symmetries can be used to define coordinates along low-loss valleys. The conserved quantities help reveal that using common initialization methods, gradient flow only explores a small part of the global minimum. By relating conserved quantities to convergence rate and sharpness of the minimum, we provide insights on how initialization impacts convergence and generalizability.
翻译:深度网络损失景观的实证研究表明,许多局部极小值通过低损耗山谷相互连接。然而,关于此类山谷的理论起源仍不清楚。我们提出了一个通用框架,用于发现参数空间中刻画出低损耗山谷的连续对称性。该框架利用激活函数的等变性,可适用于不同层架构。为将该框架推广至非线性神经网络,我们引入了一组新颖的非线性、数据依赖的对称性。这些对称性能够变换已训练模型,使其在新样本上表现相似,从而有助于构建提升特定对抗攻击鲁棒性的集成模型。进而,我们证明了与线性对称性相关的守恒量可用于定义沿低损耗山谷的坐标。守恒量有助于揭示:采用常见初始化方法时,梯度流仅探索全局极小值的一小部分。通过关联守恒量、收敛速率与极小值的尖锐度,我们提供了关于初始化如何影响收敛性与泛化能力的见解。