This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.
翻译:本报告介绍了xGen-MM(亦称为BLIP-3),一个用于开发大型多模态模型(LMMs)的框架。该框架包含精心策划的数据集、训练方案、模型架构以及由此产生的一系列LMM模型。xGen-MM,全称为xGen-MultiModal,扩展了Salesforce在基础AI模型领域的xGen计划。我们的模型在一系列任务上接受了严格评估,包括单图像和多图像基准测试。我们的预训练基础模型展现出强大的上下文学习能力,而经过指令微调的模型在具有相似模型规模的开源LMM中表现出具有竞争力的性能。此外,我们引入了一个采用DPO进行安全调优的模型,旨在减轻幻觉等有害行为并提升安全性。我们开源了我们的模型、精心策划的大规模数据集以及微调代码库,以促进LMM研究的进一步发展。相关资源将在上述项目页面提供。