Canonicalizing Multimodal Contrastive Representation Learning

As models and data scale, independently trained networks often induce analogous notions of similarity. But, matching similarities is weaker than establishing an explicit correspondence between the representation spaces, especially for multimodal models, where consistency must hold not only within each modality, but also for the learned image-text coupling. We therefore ask: given two independently trained multimodal contrastive models (with encoders $(f, g)$ and $(\widetilde{f},\widetilde{g})$) -- trained on different distributions and with different architectures -- does a systematic geometric relationship exist between their embedding spaces? If so, what form does it take, and does it hold uniformly across modalities? In this work, we show that across model families such as CLIP, SigLIP, and FLAVA, this geometric relationship is well approximated by an orthogonal map (up to a global mean shift), i.e., there exists an orthogonal map $Q$ where $Q^\top Q = I$ such that $\widetilde{f}(x)\approx Q f(x)$ for paired images $x$. Strikingly, the same $Q$ simultaneously aligns the text encoders i.e., $\widetilde{g}(y)\approx Q g(y)$ for texts $y$. Theoretically, we prove that if the multimodal kernel agrees across models on a small anchor set i.e. $\langle f(x), g(y)\rangle \approx \langle \widetilde{f}(x), \widetilde{g}(y)\rangle$, then the two models must be related by a single orthogonal map $Q$ and the same $Q$ maps images and text across models. More broadly, this finding enables backward-compatible model upgrades, avoiding costly re-embedding, and has implications for the privacy of learned representations. Our project page: https://canonical-multimodal.github.io/

翻译：随着模型与数据规模的扩大，独立训练的网络常会衍生出相似的相似性概念。然而，匹配相似性弱于在表征空间之间建立明确的对应关系，尤其对于多模态模型而言，一致性不仅需在各模态内部保持，还需在习得的图像-文本耦合关系中成立。因此我们提出：给定两个独立训练的多模态对比模型（其编码器分别为$(f, g)$与$(\widetilde{f},\widetilde{g})$）——它们在不同数据分布和架构下训练——其嵌入空间之间是否存在系统性的几何关系？若存在，该关系呈现何种形式，且是否在所有模态间一致成立？本研究表明，在CLIP、SigLIP和FLAVA等模型族中，该几何关系可被正交映射（至多相差全局均值偏移）良好近似，即存在满足$Q^\top Q = I$的正交映射$Q$，使得对于配对图像$x$有$\widetilde{f}(x)\approx Q f(x)$。值得注意的是，同一$Q$可同时对齐文本编码器，即对于文本$y$有$\widetilde{g}(y)\approx Q g(y)$。理论上我们证明：若多模态核函数在小型锚点集上跨模型一致，即$\langle f(x), g(y)\rangle \approx \langle \widetilde{f}(x), \widetilde{g}(y)\rangle$，则两模型必通过单一正交映射$Q$相关联，且该$Q$可跨模型同时映射图像与文本。更广泛而言，此发现支持向后兼容的模型升级，避免昂贵的重嵌入计算，并对习得表征的隐私性具有启示意义。项目页面：https://canonical-multimodal.github.io/