Why Self-Supervised Encoders Want to Be Normal

We develop a geometric and information-theoretic framework for encoder-decoder learning built on the Information Bottleneck (IB) principle. Recasting IB as a rate-distortion problem with Kullback-Leibler (KL) divergence as distortion, we show that the optimal representation at any distortion level is a soft clustering of the \emph{predictive manifold} $\mathcal{M}=\{p(Y|x):x\in\mathcal{X}\}$ inside the probability simplex, admitting a linear decoder in the canonical parameterization. We derive a chain of exact transformations, from flat Dirichlet to exponential to isotropic Gaussian, connecting the maximum entropy prior on the simplex to Euclidean space, with quantified entropy overhead at each step, and show that Sketched Isotropic Gaussian Regularization (SIGReg) implements a Gaussian relaxation of this principle whose overhead affects rate accounting but not achievable prediction. This relaxation provides a principled distributional regularizer for learning with limited or no supervision. Using the Conditional Entropy Bottleneck (CEB) decomposition, we derive concrete encoder losses for supervised and semi-supervised settings, estimated via minibatch marginals without variational bounds. In the self-supervised setting, the CEB conditional rate is replaced by a view-prediction proxy. SIGReg serves as the distributional regularizer for both the semi-supervised and self-supervised settings. Experiments on toy problems and FashionMNIST confirm the predicted rate-distortion trade-offs and show that the non-parametric estimator is competitive with the standard variational approach.

翻译：我们基于信息瓶颈（IB）原理，构建了一个面向编码器-解码器学习的几何与信息论框架。将IB重新表述为以库尔贝克-莱布勒（KL）散度作为失真的率失真问题，我们证明了在任意失真水平下最优表示是概率单纯形内预测流形$\mathcal{M}=\{p(Y|x):x\in\mathcal{X}\}$的软聚类形式，该表示在规范参数化下支持线性解码器。我们推导出一系列精确变换——从平坦狄利克雷分布到指数分布再到各向同性高斯分布——将单纯形上的最大熵先验与欧几里得空间联系起来，并量化了每一步的熵开销，同时证明带草图各向同性高斯正则化（SIGReg）实现了该原理的高斯松弛形式，其额外开销仅影响速率计算而非可实现预测能力。这种松弛机制为有限监督或无监督学习提供了有理论基础的分布正则化方法。通过条件熵瓶颈（CEB）分解，我们推导了监督与半监督场景下的具体编码器损失函数，该函数可通过小批量边际估计实现而无需变分界。在自监督场景中，条件熵瓶颈中的条件速率被替换为视角预测代理项。半监督与自监督场景均以SIGReg作为分布正则化器。在玩具问题与FashionMNIST上的实验验证了预测的率失真权衡，并表明非参数估计器与标准变分方法具有竞争力。