Make Deep Networks Shallow Again

Deep neural networks have a good success record and are thus viewed as the best architecture choice for complex applications. Their main shortcoming has been, for a long time, the vanishing gradient which prevented the numerical optimization algorithms from acceptable convergence. A breakthrough has been achieved by the concept of residual connections -- an identity mapping parallel to a conventional layer. This concept is applicable to stacks of layers of the same dimension and substantially alleviates the vanishing gradient problem. A stack of residual connection layers can be expressed as an expansion of terms similar to the Taylor expansion. This expansion suggests the possibility of truncating the higher-order terms and receiving an architecture consisting of a single broad layer composed of all initially stacked layers in parallel. In other words, a sequential deep architecture is substituted by a parallel shallow one. Prompted by this theory, we investigated the performance capabilities of the parallel architecture in comparison to the sequential one. The computer vision datasets MNIST and CIFAR10 were used to train both architectures for a total of 6912 combinations of varying numbers of convolutional layers, numbers of filters, kernel sizes, and other meta parameters. Our findings demonstrate a surprising equivalence between the deep (sequential) and shallow (parallel) architectures. Both layouts produced similar results in terms of training and validation set loss. This discovery implies that a wide, shallow architecture can potentially replace a deep network without sacrificing performance. Such substitution has the potential to simplify network architectures, improve optimization efficiency, and accelerate the training process.

翻译：深度神经网络在复杂应用中取得了良好的成功记录，因此被视为最佳架构选择。长期以来，其主要缺陷是梯度消失问题，这阻碍了数值优化算法实现可接受的收敛。残差连接的概念——即与常规层并行的恒等映射——实现了突破性进展。该概念适用于维度相同的层叠结构，并显著缓解了梯度消失问题。由残差连接层构成的堆叠可表示为类似泰勒展开的项展开式。这一展开表明存在截断高阶项的可能性，从而得到一种由所有初始堆叠层并行组成的单一宽层架构。换言之，顺序深度架构被并行浅层架构所取代。受此理论启发，我们研究了并行架构相较于顺序架构的性能表现。采用计算机视觉数据集MNIST和CIFAR10训练两种架构，共涉及6912种不同卷积层数、滤波器数量、卷积核大小及其他元参数组合。我们的发现揭示了深度（顺序）架构与浅层（并行）架构之间存在令人惊讶的等价性。两种布局在训练集和验证集损失方面产生了相似的结果。这一发现表明，宽浅层架构在无需牺牲性能的前提下可能替代深度网络。这种替代有望简化网络架构、提高优化效率并加速训练过程。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日