Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization

Second-order optimization has been shown to accelerate the training of deep neural networks in many applications, often yielding faster progress per iteration on the training loss compared to first-order optimizers.However, the generalization properties of second-order methods are still being debated. Theoretical investigations have proved difficult to carry out outside the tractable settings of heavily simplified model classes -- thus, the relevance of existing theories to practical deep learning applications remains unclear. Similarly, empirical studies in large-scale models and real datasets are significantly confounded by the necessity to approximate second-order updates in practice. It is often unclear whether the observed generalization behaviour arises specifically from the second-order nature of the parameter updates, or instead reflects the specific structured (e.g.\ Kronecker) approximations used or any damping-based interpolation towards first-order updates. Here, we show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep reversible architectures that are sufficiently expressive to be meaningfully applied to common benchmark datasets. We exploit this novel setting to study the training and generalization properties of the GN optimizer. We find that exact GN generalizes poorly. In the mini-batch training setting, this manifests as rapidly saturating progress even on the \emph{training} loss, with parameter updates found to overfit each mini-batchatch without producing the features that would support generalization to other mini-batches. We show that our experiments run in the ``lazy'' regime, in which the neural tangent kernel (NTK) changes very little during the course of training. This behaviour is associated with having no significant changes in neural representations, explaining the lack of generalization.

翻译：二阶优化已被证明在许多应用中能加速深度神经网络的训练，相较于一阶优化器，其在训练损失上通常能实现更快的单步进展。然而，二阶方法的泛化特性仍存在争议。理论探究在高度简化的模型类别之外难以开展——因此，现有理论对实际深度学习应用的相关性尚不明确。同样，在大规模模型和真实数据集上的实证研究，也因实践中必须对二阶更新进行近似而受到显著干扰。通常难以判断观察到的泛化行为究竟是源于参数更新的二阶特性本身，还是反映了所使用的特定结构化（例如Kronecker）近似或任何基于阻尼向一阶更新的插值。本文首次证明，在一类深度可逆架构中，精确的高斯-牛顿（GN）更新具有可处理的形式，且此类架构的表达能力足以有意义地应用于常见基准数据集。我们利用这一新颖设定研究GN优化器的训练与泛化特性。研究发现，精确GN的泛化能力较差。在小批量训练设定中，这表现为即使在训练损失上的进展也迅速饱和，参数更新被发现会过度拟合每个小批量，而未能产生支持泛化到其他小批量的特征。我们证明实验运行在“惰性”机制中，其中神经正切核（NTK）在训练过程中变化极小。该行为与神经表征未发生显著变化相关联，从而解释了泛化能力的缺失。