To achieve near-zero training error in a classification problem, the layers of a deep network have to disentangle the manifolds of data points with different labels, to facilitate the discrimination. However, excessive class separation can bring to overfitting since good generalisation requires learning invariant features, which involve some level of entanglement. We report on numerical experiments showing how the optimisation dynamics finds representations that balance these opposing tendencies with a non-monotonic trend. After a fast segregation phase, a slower rearrangement (conserved across data sets and architectures) increases the class entanglement. The training error at the inversion is remarkably stable under subsampling, and across network initialisations and optimisers, which characterises it as a property solely of the data structure and (very weakly) of the architecture. The inversion is the manifestation of tradeoffs elicited by well-defined and maximally stable elements of the training set, coined "stragglers", particularly influential for generalisation.
翻译:为了在分类问题中实现接近零的训练误差,深度网络的各层必须分离不同标签数据点的流形,以促进判别。然而,过度类别分离可能导致过拟合,因为良好的泛化需要学习不变特征,而这涉及一定程度的纠缠。我们报告了数值实验,展示了优化动力学如何通过非单调趋势找到平衡这些对立倾向的表征。在快速分离阶段之后,一个较慢的重排阶段(跨数据集和架构保持守恒)增加了类别纠缠。在子采样、不同网络初始化和优化器下,反转点的训练误差异常稳定,这将其刻画为仅由数据结构决定(以及非常弱地由架构决定)的特性。该反转现象是由训练集中定义明确且最大稳定性的元素(称为“掉队者”)所引发的权衡的体现,这些元素对泛化具有特别重要的影响。