Neural network training is commonly based on SGD. However, the understanding of SGD's ability to converge to good local minima, given the non-convex nature of loss functions and the intricate geometric characteristics of loss landscapes, remains limited. In this paper, we apply topological data analysis methods to loss landscapes to gain insights into the learning process and generalization properties of deep neural networks. We use the loss function topology to relate the local behavior of gradient descent trajectories with the global properties of the loss surface. For this purpose, we define the neural network's Topological Obstructions score ("TO-score") with the help of robust topological invariants, barcodes of the loss function, which quantify the escapability of local minima for gradient-based optimization. Our two principal observations are: 1) the loss barcode of the neural network decreases with increasing depth and width, therefore the topological obstructions to learning diminish; 2) in certain situations there is a connection between the length of minima segments in the loss barcode and the minima's generalization errors. Our statements are based on extensive experiments with fully connected, convolutional, and transformer architectures and several datasets including MNIST, FMNIST, CIFAR10, CIFAR100, SVHN, and multilingual OSCAR text dataset.
翻译:神经网络训练通常基于随机梯度下降(SGD)。然而,鉴于损失函数的非凸性质以及损失地形复杂的几何特征,对于SGD能够收敛到良好局部极小值的能力的理解仍然有限。在本文中,我们将拓扑数据分析方法应用于损失地形,以深入理解深度神经网络的学习过程与泛化特性。我们利用损失函数的拓扑结构,将梯度下降轨迹的局部行为与损失曲面的全局性质联系起来。为此,我们借助鲁棒的拓扑不变量——损失函数的条形码,定义了神经网络的“拓扑障碍分数”(TO-score),该分数量化了基于梯度的优化算法逃离局部极小值的可能性。我们的两个主要观察结果是:1)神经网络的损失条形码随着网络深度和宽度的增加而减小,因此学习的拓扑障碍也随之减弱;2)在某些情况下,损失条形码中极小值片段的长度与极小值的泛化误差之间存在关联。我们的结论基于对全连接网络、卷积网络及Transformer架构的大量实验,并使用了多个数据集,包括MNIST、FMNIST、CIFAR10、CIFAR100、SVHN以及多语言OSCAR文本数据集。