Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment, including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples$\unicode{x2013}$less than $1\%$ of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of $51.6\%$ in classification and $91.8\%$ in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.
翻译:多模态模型已在需要多模态对齐的复杂任务中展现出强大能力,包括零样本分类与跨模态检索。然而,现有模型通常依赖数百万对多模态样本,这在许多领域中获取成本极高或难以实现。本研究探索了通过对齐预训练单模态基础模型,在有限配对数据下构建多模态模型的可行性。我们证明仅需数万对样本(通常不到该领域常规数据量的$1\%$)即可实现高质量对齐。为此,我们提出了STRUCTURE——一种有效的正则化技术,能够保持单模态编码器潜在空间的邻域几何结构。此外,我们发现仅对齐最后一层通常并非最优方案,并论证了对齐跨模态表征相似度最高层所带来的优势。这两项组件可轻松整合至现有对齐方法中,在24个零样本图像分类与检索基准测试中均取得显著提升,分类任务平均相对改进达$51.6\%$,检索任务达$91.8\%$。我们的研究结果凸显了该框架在有限样本多模态学习中的有效性与广泛适用性,为资源受限领域提供了可行的发展路径。