Adaptive Estimators Show Information Compression in Deep Neural Networks

from arxiv, Accepted as a poster presentation at ICLR 2019 and reviewed on OpenReview (available at https://openreview.net/forum?id=SkeZisA5t7). Pages: 11. Figures: 9

To improve how neural networks function it is crucial to understand their learning process. The information bottleneck theory of deep learning proposes that neural networks achieve good generalization by compressing their representations to disregard information that is not relevant to the task. However, empirical evidence for this theory is conflicting, as compression was only observed when networks used saturating activation functions. In contrast, networks with non-saturating activation functions achieved comparable levels of task performance but did not show compression. In this paper we developed more robust mutual information estimation techniques, that adapt to hidden activity of neural networks and produce more sensitive measurements of activations from all functions, especially unbounded functions. Using these adaptive estimation techniques, we explored compression in networks with a range of different activation functions. With two improved methods of estimation, firstly, we show that saturation of the activation function is not required for compression, and the amount of compression varies between different activation functions. We also find that there is a large amount of variation in compression between different network initializations. Secondary, we see that L2 regularization leads to significantly increased compression, while preventing overfitting. Finally, we show that only compression of the last layer is positively correlated with generalization.

翻译：为了改善神经网络的功能，理解其学习过程至关重要。深度学习的瓶颈理论提出，神经网络通过压缩其表征来忽略与任务无关的信息，从而实现良好的泛化。然而，支持该理论的经验证据存在矛盾，因为压缩仅在使用饱和激活函数的网络中被观察到。相比之下，使用非饱和激活函数的网络能达到相当的任务性能水平，但未表现出压缩。本文开发了更稳健的互信息估计技术，这些技术能自适应神经网络的隐藏活动，并产生对所有函数（尤其是无界函数）激活的灵敏测量。利用这些自适应估计技术，我们探索了具有不同激活函数网络的压缩现象。通过两种改进的估计方法，我们首先证明激活函数的饱和并非压缩的必要条件，且不同激活函数对应的压缩程度存在差异。我们还发现不同网络初始化之间的压缩程度存在显著变异。其次，观察到L2正则化在防止过拟合的同时显著增强了压缩。最后，我们证明只有最后一层的压缩与泛化呈正相关。

相关内容

激活函数

关注 44

在人工神经网络中，给定一个输入或一组输入，节点的激活函数定义该节点的输出。一个标准集成电路可以看作是一个由激活函数组成的数字网络，根据输入的不同，激活函数可以是开(1)或关(0)。这类似于神经网络中的线性感知器的行为。然而，只有非线性激活函数允许这样的网络只使用少量的节点来计算重要问题，并且这样的激活函数被称为非线性。

【苏黎世联邦理工博士论文】深度神经网络的鲁棒性与正则化，233页pdf

专知会员服务

48+阅读 · 2022年11月4日