Natural gradient descent has a remarkable property that in the small learning rate limit, it displays an invariance with respect to network reparameterizations, leading to robust training behavior even for highly covariant network parameterizations. We show that optimization algorithms with this property can be viewed as discrete approximations of natural transformations from the functor determining an optimizer's state space from the diffeomorphism group if its configuration manifold, to the functor determining that state space's tangent bundle from this group. Algorithms with this property enjoy greater efficiency when used to train poorly parameterized networks, as the network evolution they generate is approximately invariant to network reparameterizations. More specifically, the flow generated by these algorithms in the limit as the learning rate vanishes is invariant under smooth reparameterizations, the respective flows of the parameters being determined by equivariant maps. By casting this property a natural transformation, we allow for generalizations beyond equivariance with respect to group actions; this framework can account for non-invertible maps such as projections, creating a framework for the direct comparison of training behavior across non-isomorphic network architectures, and the formal examination of limiting behavior as network size increases by considering inverse limits of these projections, should they exist. We introduce a simple method of introducing this naturality more generally and examine a number of popular machine learning training algorithms, finding that most are unnatural.
翻译:自然梯度下降具有一个显著特性:在较小学习率极限下,它对网络重参数化表现出不变性,即使对于高度协变的网络参数化也能实现稳健的训练行为。我们证明,具有此特性的优化算法可视为从由配置流形的微分同胚群决定优化器状态空间的函子,到由该群决定状态空间切丛的函子的自然变换的离散近似。当用于训练参数化不良的网络时,具有此特性的算法效率更高,因其生成的网络演化近似不变于网络重参数化。更具体地说,这些算法在学习率趋于零的极限下生成的流在光滑重参数化下保持不变,参数的相应流由等变映射决定。通过将此特性诠释为自然变换,我们得以推广到群作用等变性以外的情形;该框架可处理投影等非可逆映射,为直接比较非同构网络架构间的训练行为创建框架,并通过考虑这些投影的逆极限(若存在)来形式化研究网络规模增大时的极限行为。我们引入一种简单方法以更普遍地实现这种自然性,并考察了多种流行的机器学习训练算法,发现大多数算法是非自然的。