This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.
翻译:本报告介绍了 xGen-MM(亦称 BLIP-3),一个用于开发大型多模态模型(LMMs)的框架。该框架包含精心策划的数据集、训练方案、模型架构以及由此产生的一系列 LMMs。xGen-MM 是 xGen-MultiModal 的简称,它扩展了 Salesforce 在基础 AI 模型领域的 xGen 计划。我们的模型在一系列任务上经过了严格评估,包括单图像和多图像基准测试。我们的预训练基础模型展现出强大的上下文学习能力,而经过指令微调的模型在具有相似模型规模的开源 LMMs 中表现出具有竞争力的性能。此外,我们引入了一个采用 DPO 进行安全调优的模型,旨在减轻幻觉等有害行为并提升安全性。我们开源了我们的模型、策划的大规模数据集以及微调代码库,以促进 LMM 研究的进一步发展。相关资源将在上述项目页面提供。