Deep neural networks (DNNs) exhibit an exceptional capacity for generalization in practical applications. This work aims to capture the effect and benefits of depth for supervised learning via information-theoretic generalization bounds. We first derive two hierarchical bounds on the generalization error in terms of the Kullback-Leibler (KL) divergence or the 1-Wasserstein distance between the train and test distributions of the network internal representations. The KL divergence bound shrinks as the layer index increases, while the Wasserstein bound implies the existence of a layer that serves as a generalization funnel, which attains a minimal 1-Wasserstein distance. Analytic expressions for both bounds are derived under the setting of binary Gaussian classification with linear DNNs. To quantify the contraction of the relevant information measures when moving deeper into the network, we analyze the strong data processing inequality (SDPI) coefficient between consecutive layers of three regularized DNN models: Dropout, DropConnect, and Gaussian noise injection. This enables refining our generalization bounds to capture the contraction as a function of the network architecture parameters. Specializing our results to DNNs with a finite parameter space and the Gibbs algorithm reveals that deeper yet narrower network architectures generalize better in those examples, although how broadly this statement applies remains a question.
翻译:深度神经网络(DNN)在实际应用中展现出卓越的泛化能力。本文旨在通过信息论泛化界揭示深度对监督学习的影响与优势。我们首先基于网络中间表示的训练分布与测试分布之间的KL散度或1-Wasserstein距离,推导了两个分层泛化误差界。KL散度界随网络层索引增加而收缩,而Wasserstein界则表明存在一个作为泛化漏斗的层,该层达到最小1-Wasserstein距离。在线性DNN的二元高斯分类设定下,推导了这两个界的解析表达式。为量化信息度量随网络深度增加的收缩特性,我们分析了三种正则化DNN模型(Dropout、DropConnect和高斯噪声注入)中相邻层间的强数据处理不等式(SDPI)系数。这使我们能够精炼泛化界,将其捕捉为网络架构参数的函数。将结果应用于有限参数空间的DNN和吉布斯算法,发现在这些示例中更深但更窄的网络架构泛化性能更优,尽管该结论的普适性仍有待探究。