Growing Tiny Networks: Spotting Expressivity Bottlenecks and Fixing Them Optimally

Machine learning tasks are generally formulated as optimization problems, where one searches for an optimal function within a certain functional space. In practice, parameterized functional spaces are considered, in order to be able to perform gradient descent. Typically, a neural network architecture is chosen and fixed, and its parameters (connection weights) are optimized, yielding an architecture-dependent result. This way of proceeding however forces the evolution of the function during training to lie within the realm of what is expressible with the chosen architecture, and prevents any optimization across architectures. Costly architectural hyper-parameter optimization is often performed to compensate for this. Instead, we propose to adapt the architecture on the fly during training. We show that the information about desirable architectural changes, due to expressivity bottlenecks when attempting to follow the functional gradient, can be extracted from backpropagation. To do this, we propose a mathematical definition of expressivity bottlenecks, which enables us to detect, quantify and solve them while training, by adding suitable neurons. Thus, while the standard approach requires large networks, in terms of number of neurons per layer, for expressivity and optimization reasons, we provide tools and properties to develop an architecture starting with a very small number of neurons. As a proof of concept, we show results~on the CIFAR dataset, matching large neural network accuracy, with competitive training time, while removing the need for standard architectural hyper-parameter search.

翻译：机器学习任务通常被表述为优化问题，即在特定函数空间中搜索最优函数。实践中，为了能够执行梯度下降，通常考虑参数化的函数空间。具体而言，我们会选择并固定一个神经网络架构，然后优化其参数（连接权重），从而得到依赖于架构的结果。然而，这种方式迫使训练过程中函数的演化被限制在所选架构可表达的范围内，并阻碍了跨架构的优化。为弥补这一缺陷，通常需要执行昂贵的架构超参数优化。相反，我们提出在训练过程中动态调整架构。我们证明，在尝试遵循函数梯度时，由于表达能力瓶颈而产生的理想架构变更信息，可以通过反向传播提取。为此，我们提出了表达能力瓶颈的数学定义，使得我们能够在训练过程中检测、量化并解决这些瓶颈，方法是添加合适的神经元。因此，尽管标准方法出于表达能力和优化考虑需要每层具有大量神经元的大型网络，我们提供了从极少数神经元开始发展架构的工具与特性。作为概念验证，我们在CIFAR数据集上展示了结果，在保持训练时间竞争力的同时，达到了大型神经网络的精度，并且无需进行标准的架构超参数搜索。