Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains largely unexplored. In this work, we introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. Our findings highlight the significant potential of MLLMs in representing multimodal inputs compared to previous approaches. By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%. Additionally, this approach eliminates the need for costly multimodal training data collection. Extensive experiments across four types of tasks demonstrate the effectiveness of E5-V. As a universal multimodal model, E5-V not only achieves but often surpasses state-of-the-art performance in each task, despite being trained on a single modality.
翻译:多模态大语言模型(MLLMs)在通用视觉与语言理解方面展现出显著进展。然而,利用MLLMs进行多模态信息表征的研究仍处于探索阶段。本文提出一种新框架E5-V,旨在适配MLLMs以实现通用多模态嵌入。研究结果表明,相较于现有方法,MLLMs在多模态输入表征方面具有显著潜力。通过结合提示词技术,E5-V有效弥合了不同模态输入之间的语义鸿沟,即使无需微调也能在多模态嵌入任务中展现卓越性能。我们为E5-V设计了单模态训练方案,该模型仅通过文本对进行训练。相比传统的图文对多模态训练方法,该方案在显著降低约95%训练成本的同时实现了性能突破,并避免了昂贵多模态训练数据集的构建需求。在四类任务上的大量实验验证了E5-V的有效性。作为通用多模态模型,E5-V在仅接受单模态训练的情况下,不仅达到且在多任务中超越了当前最优性能水平。