Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.
翻译:当前最先进的视觉语言模型(VLMs)仍属于专有系统。性能最强的开源权重模型严重依赖专有VLMs生成的合成数据来实现优异性能,本质上是通过蒸馏将封闭式VLMs转化为开放式模型。这导致学术界长期缺乏从零构建高性能VLMs的基础性知识。本文提出Molmo系列模型——在开放程度同类模型中达到最先进水平的新型VLMs。我们的核心贡献是名为PixMo的全新数据集集合,包含用于预训练的高细节图像描述数据集、用于微调的自由形式图像问答数据集,以及创新的二维指向数据集,所有数据均在不借助外部VLMs的情况下采集完成。本方法的成功依赖于精细的建模选择、充分优化的训练流程,以及最关键的新建数据集质量。我们72B规模的顶尖模型不仅在开源权重与数据的同类模型中表现卓越,更超越了包括Claude 3.5 Sonnet、Gemini 1.5 Pro及Flash在内的更大规模专有模型,根据学术基准测试和大规模人工评估,其性能仅次于GPT-4o。我们的模型权重、新建数据集及源代码已发布于https://molmo.allenai.org/blog。