Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models (MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.
翻译:尽管多模态对比学习在视觉与语言表征对齐方面取得了成功,但一个持续的几何异常——模态间隙——依然存在:表达相同语义的不同模态的嵌入向量占据着系统性偏移的区域。先前弥合该间隙的方法主要受限于过度简化的各向同性假设,阻碍了其在大规模场景中的应用。本文通过精确刻画模态间隙的几何形态并利用其实现高效模型扩展,以应对这些局限性。首先,我们提出固定框架模态间隙理论,该理论将冻结参考框架内的模态间隙分解为稳定的偏差和各向异性残差。在此精确建模的指导下,我们引入了ReAlign,一种免训练的多模态对齐策略。利用海量非配对数据的统计信息,ReAlign通过包含锚点对齐、轨迹对齐和质心对齐的三步过程,将文本表征对齐到图像表征分布中,从而显式地校正几何错位。基于ReAlign,我们提出了ReVision,一种用于多模态大语言模型(MLLMs)的可扩展训练范式。ReVision将ReAlign集成到预训练阶段,使模型能够在视觉指令微调之前,从非配对的文本中学习视觉表征的分布,而无需大规模、高质量的图像-文本对。我们的框架证明,统计对齐的非配对数据可以有效替代昂贵的图像-文本对,为MLLMs的高效扩展提供了一条稳健的路径。