Reviewing Evolution of Learning Functions and Semantic Information Measures for Understanding Deep Learning

A new trend in deep learning, represented by Mutual Information Neural Estimation (MINE) and Information Noise Contrast Estimation (InfoNCE), is emerging. In this trend, similarity functions and Estimated Mutual Information (EMI) are used as learning and objective functions. Coincidentally, EMI is essentially the same as Semantic Mutual Information (SeMI) proposed by the author 30 years ago. This paper first reviews the evolutionary histories of semantic information measures and learning functions. Then, it briefly introduces the author's semantic information G theory with the rate-fidelity function R(G) (G denotes SeMI, and R(G) extends R(D)) and its applications to multi-label learning, the maximum Mutual Information (MI) classification, and mixture models. Then it discusses how we should understand the relationship between SeMI and Shan-non's MI, two generalized entropies (fuzzy entropy and coverage entropy), Autoencoders, Gibbs distributions, and partition functions from the perspective of the R(G) function or the G theory. An important conclusion is that mixture models and Restricted Boltzmann Machines converge because SeMI is maximized, and Shannon's MI is minimized, making information efficiency G/R close to 1. A potential opportunity is to simplify deep learning by using Gaussian channel mixture models for pre-training deep neural networks' latent layers without considering gradients. It also discusses how the SeMI measure is used as the reward function (reflecting purposiveness) for reinforcement learning. The G theory helps interpret deep learning but is far from enough. Combining semantic information theory and deep learning will accelerate their development.

翻译：深度学习领域正兴起一种新趋势，以互信息神经估计（MINE）和信息噪声对比估计（InfoNCE）为代表。该趋势中，相似度函数和估计互信息（EMI）被用作学习函数和目标函数。巧合的是，EMI本质上与作者30年前提出的语义互信息（SeMI）概念相同。本文首先回顾了语义信息测度与学习函数的演进历史，继而简要介绍了作者基于率-保真度函数R(G)（G表示SeMI，R(G)是R(D)的扩展）的语义信息G理论，及其在多标签学习、最大互信息（MI）分类和混合模型中的应用。随后，本文从R(G)函数或G理论的视角，探讨了如何理解SeMI与香农互信息、两种广义熵（模糊熵与覆盖熵）、自编码器、吉布斯分布及配分函数之间的关系。一个重要结论是：混合模型和受限玻尔兹曼机之所以收敛，是因为SeMI被最大化而香农互信息被最小化，使得信息效率G/R趋近于1。一个潜在机遇是：利用高斯信道混合模型对深度神经网络的隐层进行预训练，无需考虑梯度即可简化深度学习。本文还讨论了SeMI测度如何作为强化学习中的奖励函数（体现目的性）。G理论有助于解释深度学习，但远未完善。融合语义信息理论与深度学习将加速两者的发展。