Recently, information theoretic analysis has become a popular framework for understanding the generalization behavior of deep neural networks. It allows a direct analysis for stochastic gradient/Langevin descent (SGD/SGLD) learning algorithms without strong assumptions such as Lipschitz or convexity conditions. However, the current generalization error bounds within this framework are still far from optimal, while substantial improvements on these bounds are quite challenging due to the intractability of high-dimensional information quantities. To address this issue, we first propose a novel information theoretical measure: kernelized Renyi's entropy, by utilizing operator representation in Hilbert space. It inherits the properties of Shannon's entropy and can be effectively calculated via simple random sampling, while remaining independent of the input dimension. We then establish the generalization error bounds for SGD/SGLD under kernelized Renyi's entropy, where the mutual information quantities can be directly calculated, enabling evaluation of the tightness of each intermediate step. We show that our information-theoretical bounds depend on the statistics of the stochastic gradients evaluated along with the iterates, and are rigorously tighter than the current state-of-the-art (SOTA) results. The theoretical findings are also supported by large-scale empirical studies1.
翻译:近年来,信息论分析已成为理解深度神经网络泛化行为的一种流行框架。它允许直接分析随机梯度/朗之万下降(SGD/SGLD)学习算法,无需诸如Lipschitz或凸性条件等严格假设。然而,当前该框架下的泛化误差界仍远非最优,而由于高维信息量的棘手性,对这些界的实质性改进颇具挑战。为解决该问题,我们首先提出一种新的信息理论度量:核化Rényi熵,通过利用希尔伯特空间中的算子表示来实现。它继承了香农熵的性质,可通过简单的随机采样有效计算,同时保持与输入维度无关。随后,我们在核化Rényi熵下建立了SGD/SGLD的泛化误差界,其中互信息量可直接计算,从而能够评估每个中间步骤的紧致性。我们证明,我们的信息论界依赖于沿迭代步评估的随机梯度的统计量,且严格优于当前最先进(SOTA)结果。理论发现也得到了大规模实证研究的支持。