This work provides a comprehensive analysis of the generalization properties of Neural Operators (NOs) and their derived architectures. Through empirical evaluation of the test loss, analysis of the complexity-based generalization bounds, and qualitative assessments of the visualization of the loss landscape, we investigate modifications aimed at enhancing the generalization capabilities of NOs. Inspired by the success of Transformers, we propose ${\textit{s}}{\text{NO}}+\varepsilon$, which introduces a kernel integral operator in lieu of self-Attention. Our results reveal significantly improved performance across datasets and initializations, accompanied by qualitative changes in the visualization of the loss landscape. We conjecture that the layout of Transformers enables the optimization algorithm to find better minima, and stochastic depth, improve the generalization performance. As a rigorous analysis of training dynamics is one of the most prominent unsolved problems in deep learning, our exclusive focus is on the analysis of the complexity-based generalization of the architectures. Building on statistical theory, and in particular Dudley theorem, we derive upper bounds on the Rademacher complexity of NOs, and ${\textit{s}}{\text{NO}}+\varepsilon$. For the latter, our bounds do not rely on norm control of parameters. This makes it applicable to networks of any depth, as long as the random variables in the architecture follow a decay law, which connects stochastic depth with generalization, as we have conjectured. In contrast, the bounds in NOs, solely rely on norm control of the parameters, and exhibit an exponential dependence on depth. Furthermore, our experiments also demonstrate that our proposed network exhibits remarkable generalization capabilities when subjected to perturbations in the data distribution. In contrast, NO perform poorly in out-of-distribution scenarios.
翻译:本文全面分析了神经算子(NOs)及其衍生架构的泛化特性。通过测试损失的实证评估、基于复杂度的泛化界分析以及损失景观可视化的定性评估,我们研究了旨在增强NOs泛化能力的改进方案。受Transformer成功经验的启发,我们提出${\textit{s}}{\text{NO}}+\varepsilon$,该架构引入核积分算子替代自注意力机制。实验结果表明,该模型在多种数据集和初始化条件下性能显著提升,并伴随损失景观可视化的定性变化。我们推测,Transformer的布局使得优化算法能够找到更优的极小值,而随机深度则改善了泛化性能。鉴于训练动力学的严格分析是深度学习领域最突出的未解决问题之一,本文专注于架构的基于复杂度的泛化分析。基于统计理论,特别是Dudley定理,我们推导出NOs及${\textit{s}}{\text{NO}}+\varepsilon$的Rademacher复杂度上界。对于后者,我们的上界不依赖于参数的范数控制,因此适用于任意深度的网络,只要架构中的随机变量遵循衰减规律——这正如我们推测的:随机深度与泛化性能存在关联。相比之下,NOs的泛化界完全依赖于参数范数控制,且对深度呈现指数依赖。此外,我们的实验表明,当数据分布受到扰动时,所提网络展现出显著的泛化能力,而NOs在分布外场景中表现欠佳。