论残差网络深度 (On residual network depth)

Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network's inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model's complexity.

翻译：深度残差架构，如ResNet和Transformer，使得模型能够达到前所未有的深度，然而对于深度为何如此有效的形式化理解仍然是一个开放性问题。继Veit等人（2016）之后，一种流行的直觉是这些残差网络的行为类似于许多较浅模型的集成。我们的核心发现是一个显式的解析公式，它验证了这种集成视角，证明了增加网络深度在数学上等价于扩展这个隐式集成的规模。此外，我们的扩展揭示了一种层次化的集成结构，其中计算路径的组合增长导致输出信号的爆炸性增长，这解释了在训练深度模型时归一化层的历史必要性。这一见解为历史上对归一化层的依赖提供了一个第一性原理的解释，并为SkipInit和Fixup等一系列成功的无归一化技术提供了新的视角。然而，尽管先前的方法通过优化器分析或与批归一化的启发式类比来推断缩放因子，我们的工作首次提供了直接从网络固有的函数结构推导出的解释。具体而言，我们的残差扩展定理表明，缩放每个残差模块提供了一种原则性的解决方案，以驯服这些架构固有的组合爆炸。我们进一步证明，这种缩放作为一种容量控制，同时也隐式地正则化了模型的复杂度。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日