Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning

from arxiv, Accepted at NeurIPS 2025. This revised version provides a proof of Lemma B.5, previously stated as a conjecture in the original submission. Code available at https://github.com/ReubenDo/JSDlowerbound/

Mutual Information (MI) is a fundamental measure of statistical dependence widely used in representation learning. While direct optimization of MI via its definition as a Kullback-Leibler divergence (KLD) is often intractable, many recent methods have instead maximized alternative dependence measures, most notably, the Jensen-Shannon divergence (JSD) between joint and product of marginal distributions via discriminative losses. However, the connection between these surrogate objectives and MI remains poorly understood. In this work, we bridge this gap by deriving a new, tight, and tractable lower bound on KLD as a function of JSD in the general case. By specializing this bound to joint and marginal distributions, we demonstrate that maximizing the JSD-based information increases a guaranteed lower bound on mutual information. Furthermore, we revisit the practical implementation of JSD-based objectives and observe that minimizing the cross-entropy loss of a binary classifier trained to distinguish joint from marginal pairs recovers a known variational lower bound on the JSD. Extensive experiments demonstrate that our lower bound is tight when applied to MI estimation. We compared our lower bound to state-of-the-art neural estimators of variational lower bound across a range of established reference scenarios. Our lower bound estimator consistently provides a stable, low-variance estimate of a tight lower bound on MI. We also demonstrate its practical usefulness in the context of the Information Bottleneck framework. Taken together, our results provide new theoretical justifications and strong empirical evidence for using discriminative learning in MI-based representation learning.

翻译：互信息是表示学习中广泛使用的一种统计依赖性的基本度量。虽然通过其作为Kullback-Leibler散度的定义直接优化互信息通常是难以处理的，但许多近期方法转而最大化其他依赖性度量，最显著的是通过判别式损失最大化联合分布与边缘分布乘积之间的Jensen-Shannon散度。然而，这些替代目标与互信息之间的联系仍鲜为人知。在本工作中，我们通过推导出一个在一般情况下将KLD表示为JSD函数的新的、紧致的且可处理的下界，来弥合这一差距。通过将此下界专门应用于联合分布与边缘分布，我们证明了最大化基于JSD的信息会提高互信息的一个有保证的下界。此外，我们重新审视了基于JSD目标的实际实现，并观察到最小化用于区分联合对与边缘对的二元分类器的交叉熵损失，可恢复JSD的一个已知变分下界。大量实验表明，当应用于MI估计时，我们的下界是紧致的。我们在一系列已建立的参考场景中，将我们的下界与最先进的变分下界神经估计器进行了比较。我们的下界估计器始终能提供稳定、低方差的紧致MI下界估计。我们还证明了其在信息瓶颈框架中的实际用途。综上所述，我们的结果为在基于互信息的表示学习中使用判别式学习提供了新的理论依据和强有力的经验证据。