We perform an empirical study of the behaviour of deep networks when fully linearizing some of its feature channels through a sparsity prior on the overall number of nonlinear units in the network. In experiments on image classification and machine translation tasks, we investigate how much we can simplify the network function towards linearity before performance collapses. First, we observe a significant performance gap when reducing nonlinearity in the network function early on as opposed to late in training, in-line with recent observations on the time-evolution of the data-dependent NTK. Second, we find that after training, we are able to linearize a significant number of nonlinear units while maintaining a high performance, indicating that much of a network's expressivity remains unused but helps gradient descent in early stages of training. To characterize the depth of the resulting partially linearized network, we introduce a measure called average path length, representing the average number of active nonlinearities encountered along a path in the network graph. Under sparsity pressure, we find that the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width, which in turn depend on task difficulty.
翻译:我们通过在全网络非线性单元总数上施加稀疏先验,对深度网络在部分特征通道完全线性化时的行为进行了实证研究。在图像分类和机器翻译任务的实验中,我们探究了在线性化网络函数过程中,性能崩溃前可达到的简化程度。首先,我们观察到训练早期减少网络函数非线性与训练后期相比存在显著性能差距,这与近期关于数据依赖型神经正切核时间演化的发现一致。其次,我们发现训练后可在保持高性能的同时线性化大量非线性单元,这表明网络的大部分表达能力虽未被使用,却在训练初期辅助了梯度下降。为刻画部分线性化后网络的深度特性,我们引入了一种称为平均路径长度的度量指标,其代表网络图中路径上活跃非线性单元的期望数量。在稀疏性约束下,我们发现剩余非线性单元会组织成不同结构,形成有效深度和宽度近乎恒定的核心网络,而该核心网络的规模取决于任务难度。