Gradient-based optimization of neural differential equations and other parameterized dynamical systems fundamentally relies on the ability to differentiate numerical solutions with respect to model parameters. In stiff systems, it has been observed that sensitivities to parameters controlling fast-decaying modes become vanishingly small during training, leading to optimization difficulties. In this paper, we show that this vanishing gradient phenomenon is not an artifact of any particular method, but a universal feature of all A-stable and L-stable stiff numerical integration schemes. We analyze the rational stability function for general stiff integration schemes and demonstrate that the relevant parameter sensitivities, governed by the derivative of the stability function, decay to zero for large stiffness. Explicit formulas for common stiff integration schemes are provided, which illustrate the mechanism in detail. Finally, we rigorously prove that the slowest possible rate of decay for the derivative of the stability function is $O(|z|^{-1})$, revealing a fundamental limitation: all A-stable time-stepping methods inevitably suppress parameter gradients in stiff regimes, posing a significant barrier for training and parameter identification in stiff neural ODEs.
翻译:基于梯度的神经微分方程及其他参数化动力系统的优化,根本上依赖于对数值解关于模型参数求导的能力。在刚性系统中,已观察到控制快速衰减模态的参数灵敏度在训练过程中变得极小,导致优化困难。本文证明,这种梯度消失现象并非特定方法的缺陷,而是所有A-稳定与L-稳定刚性数值积分格式的普遍特征。我们分析了通用刚性积分格式的有理稳定函数,证明由稳定函数导数主导的相关参数灵敏度在大刚度条件下衰减至零。文中提供了常见刚性积分格式的显式公式,详细阐释了该机制。最后,我们严格证明了稳定函数导数的衰减速率下界为$O(|z|^{-1})$,揭示了一个根本性局限:所有A-稳定时间步进方法在刚性区域必然抑制参数梯度,这对刚性神经常微分方程的训练与参数辨识构成了显著障碍。