Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.
翻译:图像-文本对比预训练已成为视觉表示学习的主导范式,然而现有方法产生的表示往往仍部分按模态组织。我们提出ITO框架,通过两种协同机制解决这一局限。多模态多重对齐通过挖掘多样化的图像-文本对应关系来丰富监督信号,而轻量级的训练时多模态融合模块则强制实现结构化的跨模态交互。关键在于,该融合模块在推理阶段被丢弃,从而保持了标准双编码器架构的效率。大量实验表明,ITO在分类、检索和多模态基准测试中均持续优于强基线模型。我们的分析揭示:多重对齐驱动判别力提升,而训练时融合则充当关键的结构正则化器——它消除了模态间隙,稳定了训练动态,从而避免了激进对比学习中常见的早期饱和现象。