Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.
翻译:尽管多模态对比学习在视觉与语言表示对齐方面取得了成功,但一个持久存在的几何异常现象——“模态间隙”——依然存在:表达相同语义的不同模态嵌入在空间中呈现出系统性偏移。先前弥合这一间隙的方法大多受限于过于简化的各向同性假设,阻碍了其在大规模场景中的应用。本文通过精确刻画模态间隙的几何形状并利用其实现高效模型扩展,解决了上述局限。首先,我们提出了固定框架模态间隙理论,将冻结参考框架内的模态间隙分解为稳定偏差与各向异性残差。在该精确建模的指导下,我们引入了ReAlign——一种无需训练的模态对齐策略。通过利用大规模非配对数据的统计信息,ReAlign经由锚点追踪与质心对齐三步流程将文本表示对齐至图像表示分布,从而显式修正几何错位。基于ReAlign,我们提出了ReVision——一种面向多模态大语言模型(MLLMs)的可扩展训练范式。ReVision将ReAlign融入预训练阶段,使模型在视觉指令微调前即可从非配对文本中学习视觉表示分布,无需依赖大规模高质量图像-文本对。我们的框架证明,经统计对齐的非配对数据可有效替代昂贵的图像-文本对,为MLLMs的高效扩展提供了稳健路径。