The architecture of a deep neural network is defined explicitly in terms of the number of layers, the width of each layer and the general network topology. Existing optimisation frameworks neglect this information in favour of implicit architectural information (e.g. second-order methods) or architecture-agnostic distance functions (e.g. mirror descent). Meanwhile, the most popular optimiser in practice, Adam, is based on heuristics. This paper builds a new framework for deriving optimisation algorithms that explicitly leverage neural architecture. The theory extends mirror descent to non-convex composite objective functions: the idea is to transform a Bregman divergence to account for the non-linear structure of neural architecture. Working through the details for deep fully-connected networks yields automatic gradient descent: a first-order optimiser without any hyperparameters. Automatic gradient descent trains both fully-connected and convolutional networks out-of-the-box and at ImageNet scale. A PyTorch implementation is available at https://github.com/jxbz/agd and also in Appendix B. Overall, the paper supplies a rigorous theoretical foundation for a next-generation of architecture-dependent optimisers that work automatically and without hyperparameters.
翻译:深度神经网络的架构明确由层数、每层宽度以及整体网络拓扑结构定义。现有优化框架倾向于忽略这些信息,转而依赖隐式架构信息(如二阶方法)或与架构无关的距离函数(如镜像下降)。与此同时,实际中最流行的优化器Adam基于启发式方法。本文构建了一个新的框架,用于推导明确利用神经架构的优化算法。该理论将镜像下降扩展至非凸复合目标函数:其核心思想是通过变换布雷格曼散度来适应神经架构的非线性结构。针对深度全连接网络的细节推导,产生了自动梯度下降:一种无需任何超参数的一阶优化器。自动梯度下降可直接训练全连接网络和卷积网络,且可在ImageNet规模上运行。PyTorch实现见 https://github.com/jxbz/agd 及附录B。总体而言,本文为下一代自动运行且无需超参数的架构依赖型优化器提供了严格的理论基础。