A quantitative analysis of semantic information in deep representations of text and images

It was recently observed that the representations of different models that process identical or semantically related inputs tend to align. We analyze this phenomenon using the Information Imbalance, an asymmetric rank-based measure that quantifies the capability of a representation to predict another, providing a proxy of the cross-entropy which can be computed efficiently in high-dimensional spaces. By measuring the Information Imbalance between representations generated by DeepSeek-V3 processing translations, we find that semantic information is spread across many tokens, and that semantic predictability is strongest in a set of central layers of the network, robust across six language pairs. We measure clear information asymmetries: English representations are systematically more predictive than those of other languages, and DeepSeek-V3 representations are more predictive of those in a smaller model such as Llama3-8b than the opposite. In the visual domain, we observe that semantic information concentrates in middle layers for autoregressive models and in final layers for encoder models, and these same layers yield the strongest cross-modal predictability with textual representations of image captions. Notably, two independently trained models (DeepSeek-V3 and DinoV2) achieve stronger cross-modal predictability than the jointly trained CLIP model, suggesting that model scale may outweigh explicit multimodal training. Our results support the hypothesis of semantic convergence across languages, modalities, and architectures, while showing that directed predictability between representations varies strongly with layer-depth, model scale, and language.

翻译：近期研究发现，处理相同或语义相关输入的不同模型，其表示往往趋于对齐。我们使用信息不平衡性这一非对称的基于排序的度量方法分析该现象，该方法量化了一个表示预测另一个表示的能力，可作为高维空间中可高效计算的交叉熵代理指标。通过测量DeepSeek-V3处理翻译任务时生成表示之间的信息不平衡性，我们发现语义信息分散在多个词元中，且语义可预测性在网络的一组中间层最为显著，这一现象在六种语言对中均保持稳健。我们观测到明显的信息不对称性：英文表示系统性地比其他语言的表示更具预测力；DeepSeek-V3表示对较小模型（如Llama3-8b）表示的预测能力远强于反向预测。在视觉领域，我们观察到自回归模型的语义信息集中于中间层，而编码器模型则集中于最终层，这些层与图像描述文本表示之间具有最强的跨模态可预测性。值得注意的是，两个独立训练的模型（DeepSeek-V3与DinoV2）比联合训练的CLIP模型实现了更强的跨模态可预测性，这表明模型规模可能超越显式的多模态训练。我们的结果支持了跨语言、跨模态与跨架构的语义收敛假说，同时揭示了表示间的定向可预测性随网络层深度、模型规模和语言类型存在显著差异。