The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.
翻译:柏拉图表示假说认为,在不同模态数据上训练的神经网络会收敛于一个共享的世界统计模型。近期研究利用这一收敛特性,通过轻量级对齐层将冻结的预训练视觉与语言模型进行对齐,但这类方法通常依赖对比损失和数百万的配对样本。本文探讨是否能在显著减少监督信号的情况下实现有意义的对齐。我们提出一种半监督设定:利用少量图文配对数据与大量未配对数据,对预训练的单模态编码器进行对齐。针对这一挑战,我们提出SOTAlign——一个两阶段框架:首先通过线性教师模型从有限配对数据中恢复粗略的共享几何结构,随后基于最优传输的散度在未配对样本上细化对齐过程,该过程能在不过度约束目标空间的前提下传递关系结构。与现有半监督方法不同,SOTAlign能有效利用未配对的图像与文本数据,在不同数据集和编码器组合中学习稳健的联合嵌入表示,其性能显著优于监督与半监督基线方法。