Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks

Massively parallel hardware (GPUs) and long sequence data have made parallel algorithms essential for machine learning at scale. Yet dynamical systems, like recurrent neural networks and Markov chain Monte Carlo, were thought to suffer from sequential bottlenecks. Recent work showed that dynamical systems can in fact be parallelized across the sequence length by reframing their evaluation as a system of nonlinear equations, which can be solved with Newton's method using a parallel associative scan. However, these parallel Newton methods struggled with limitations, primarily inefficiency, instability, and lack of convergence guarantees. This thesis addresses these limitations with methodological and theoretical contributions, drawing particularly from optimization. Methodologically, we develop scalable and stable parallel Newton methods, based on quasi-Newton and trust-region approaches. The quasi-Newton methods are faster and more memory efficient, while the trust-region approaches are significantly more stable. Theoretically, we unify many fixed-point methods into our parallel Newton framework, including Picard and Jacobi iterations. We establish a linear convergence rate for these techniques that depends on the method's approximation accuracy and stability. Moreover, we give a precise condition, rooted in dynamical stability, that characterizes when parallelization provably accelerates a dynamical system and when it cannot. Specifically, the sign of the Largest Lyapunov Exponent of a dynamical system determines whether or not parallel Newton methods converge quickly. In sum, this thesis unlocks scalable and stable methods for parallelizing sequential computation, and provides a firm theoretical basis for when such techniques will and will not work. This thesis also serves as a guide to parallel Newton methods for researchers who want to write the next chapter in this ongoing story.

翻译：大规模并行硬件（GPU）与长序列数据使得并行算法成为大规模机器学习的关键。然而，动态系统（如循环神经网络和马尔可夫链蒙特卡洛方法）曾被认为存在顺序瓶颈问题。近期研究表明，通过将动态系统的求值重构为非线性方程组，并利用并行关联扫描结合牛顿法求解，实际上可以实现跨序列长度的并行化。然而，这些并行牛顿方法存在效率低下、稳定性不足及缺乏收敛保证等主要局限。本论文通过方法学与理论贡献（尤其借鉴优化理论）解决了这些局限。在方法学层面，我们基于拟牛顿和信赖域方法开发了可扩展且稳定的并行牛顿方法：拟牛顿法具有更快的速度和更高的内存效率，而信赖域方法则显著提升了稳定性。在理论层面，我们将包括皮卡迭代和雅可比迭代在内的多种不动点方法统一到并行牛顿框架中，并建立了这些技术的线性收敛速率，该速率取决于方法的逼近精度与稳定性。此外，我们基于动态稳定性提出了一个精确条件，用于刻画并行化何时可证明加速动态系统、何时无法实现。具体而言，动态系统的最大李雅普诺夫指数的符号决定了并行牛顿方法能否快速收敛。总之，本论文为顺序计算的并行化提供了可扩展且稳定的方法，并为这些技术的适用与失效场景奠定了坚实的理论基础。本论文也可作为并行牛顿方法的指南，为希望在这一持续发展的领域中书写新篇章的研究者提供参考。