Most multi-modal tasks can be formulated into problems of either generation or embedding. Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding. To explore the minimalism of multi-modal paradigms, we attempt to achieve only one model per modality in this work. We propose a Multi-Modal Generative Embedding Model (MM-GEM), whereby the generative and embedding objectives are encapsulated in one Large Language Model. We also propose a PoolAggregator to boost efficiency and enable the ability of fine-grained embedding and generation. A surprising finding is that these two objectives do not significantly conflict with each other. For example, MM-GEM instantiated from ViT-Large and TinyLlama shows competitive performance on benchmarks for multimodal embedding models such as cross-modal retrieval and zero-shot classification, while has good ability of image captioning. Additionally, MM-GEM can seamlessly execute region-level image caption generation and retrieval tasks. Besides, the advanced text model in MM-GEM brings over 5% improvement in Recall@1 for long text and image retrieval.
翻译:大多数多模态任务可被形式化为生成问题或嵌入问题。现有模型通常通过将语言模块解耦为用于生成的文本解码器和用于嵌入的文本编码器来处理这两类问题。为探索多模态范式的极简主义,本研究尝试实现每个模态仅需单一模型。我们提出多模态生成嵌入模型,通过将生成目标与嵌入目标封装于同一个大语言模型中实现功能集成。同时提出池化聚合器以提升效率,并实现细粒度嵌入与生成能力。一个令人惊奇的发现是这两项目标并未产生显著冲突。例如,基于ViT-Large与TinyLlama实例化的多模态生成嵌入模型在多模态嵌入模型基准测试中展现出竞争力,同时在图像描述生成任务中表现良好。此外,该模型能够无缝执行区域级图像描述生成与检索任务。值得注意的是,模型中先进的文本模块为长文本-图像检索任务的Recall@1指标带来超过5%的性能提升。