Training a high-quality deep neural network requires choosing suitable hyperparameters, which is a non-trivial and expensive process. Current works try to automatically optimize or design principles of hyperparameters, such that they can generalize to diverse unseen scenarios. However, most designs or optimization methods are agnostic to the choice of network structures, and thus largely ignore the impact of neural architectures on hyperparameters. In this work, we precisely characterize the dependence of initializations and maximal learning rates on the network architecture, which includes the network depth, width, convolutional kernel size, and connectivity patterns. By pursuing every parameter to be maximally updated with the same mean squared change in pre-activations, we can generalize our initialization and learning rates across MLPs (multi-layer perception) and CNNs (convolutional neural network) with sophisticated graph topologies. We verify our principles with comprehensive experiments. More importantly, our strategy further sheds light on advancing current benchmarks for architecture design. A fair comparison of AutoML algorithms requires accurate network rankings. However, we demonstrate that network rankings can be easily changed by better training networks in benchmarks with our architecture-aware learning rates and initialization.
翻译:训练高质量的深度神经网络需要选择合适的超参数,这是一个既重要又昂贵的过程。当前研究试图自动优化或设计超参数的原则,使其能泛化到各种未见场景。然而,大多数设计或优化方法并未考虑网络结构的选择,因此很大程度上忽略了神经架构对超参数的影响。在本工作中,我们精确刻画了初始化和最大学习率对网络架构(包括网络深度、宽度、卷积核大小和连接模式)的依赖关系。通过确保每个参数在预激活值均方变化一致的情况下得到最大更新,我们能够将初始化和学习率泛化到具有复杂图拓扑结构的多层感知机和卷积神经网络中。我们通过全面实验验证了这些原则。更重要的是,我们的策略进一步为改进当前架构设计基准提供了启示。自动机器学习算法的公平比较需要准确的网络排名,但我们证明,通过使用我们提出的架构感知学习率和初始化方法,可以轻易改变基准测试中训练网络的排名。