Neural networks trained with standard objectives exhibit behaviors characteristic of probabilistic inference: soft clustering, prototype specialization, and Bayesian uncertainty tracking. These phenomena appear across architectures -- in attention mechanisms, classification heads, and energy-based models -- yet existing explanations rely on loose analogies to mixture models or post-hoc architectural interpretation. We provide a direct derivation. For any objective with log-sum-exp structure over distances or energies, the gradient with respect to each distance is exactly the negative posterior responsibility of the corresponding component: $\partial L / \partial d_j = -r_j$. This is an algebraic identity, not an approximation. The immediate consequence is that gradient descent on such objectives performs expectation-maximization implicitly -- responsibilities are not auxiliary variables to be computed but gradients to be applied. No explicit inference algorithm is required because inference is embedded in optimization. This result unifies three regimes of learning under a single mechanism: unsupervised mixture modeling, where responsibilities are fully latent; attention, where responsibilities are conditioned on queries; and cross-entropy classification, where supervision clamps responsibilities to targets. The Bayesian structure recently observed in trained transformers is not an emergent property but a necessary consequence of the objective geometry. Optimization and inference are the same process.
翻译:使用标准目标函数训练的神经网络展现出概率推断的特征行为:软聚类、原型特化和贝叶斯不确定性追踪。这些现象普遍存在于各类架构中——包括注意力机制、分类头和基于能量的模型——然而现有解释依赖于与混合模型的粗略类比或事后架构解读。我们提供了直接推导:对于任何在距离或能量上具有对数求和指数结构的损失函数,每个距离对应的梯度恰好等于对应分量的负后验责任度:$\partial L / \partial d_j = -r_j$。这是代数恒等式而非近似结果。直接推论是:在此类目标函数上的梯度下降隐式执行了期望最大化算法——责任度并非需要计算的辅助变量,而是待应用的梯度。由于推断已嵌入优化过程,无需显式推断算法。该结果将三种学习机制统一于单一框架下:无监督混合建模(责任度完全隐变量)、注意力机制(责任度以查询为条件)以及交叉熵分类(监督信号将责任度固定至目标值)。近期在训练后的Transformer中观察到的贝叶斯结构并非涌现特性,而是目标函数几何形态的必然结果。优化与推断实为同一过程。