An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We conclude by deriving GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers -- the latter two methods are based on a rectangular Newton-Schulz iteration (Kovarik, 1970; Bj\"orck & Bowie, 1971). A variant of our methods was used to set speed records for training NanoGPT. Overall, we hope that our theory of modular duality will yield a next generation of fast and scalable optimizers for general neural architectures.
翻译:优化理论中的一个古老观点认为,由于梯度是对偶向量,在将其映射到权重所在的原空间之前,不应直接从权重中减去梯度。本文严肃对待这一观点,并为通用神经网络构建了这样的对偶映射。我们称此映射为模块化对偶化,它为同时具备 a) 快速性和 b) 可扩展性的训练算法提供了统一的理论基础。模块化对偶化首先根据每层的语义为层分配算子范数,然后利用这些逐层范数递归地诱导出完整神经架构权重空间上的对偶映射。最后,我们推导出适用于GPU的Embed、Linear和Conv2D层对偶化算法——后两种方法基于矩形牛顿-舒尔茨迭代法(Kovarik, 1970; Björck & Bowie, 1971)。我们方法的某个变体曾用于创下训练NanoGPT的速度记录。总体而言,我们希望模块化对偶理论能为通用神经架构催生新一代快速可扩展的优化器。