Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke,Christopher Clark,Sangho Lee,Rohun Tripathi,Yue Yang,Jae Sung Park,Mohammadreza Salehi,Niklas Muennighoff,Kyle Lo,Luca Soldaini,Jiasen Lu,Taira Anderson,Erin Bransom,Kiana Ehsani,Huong Ngo,YenSung Chen,Ajay Patel,Mark Yatskar,Chris Callison-Burch,Andrew Head,Rose Hendrix,Favyen Bastani,Eli VanderBilt,Nathan Lambert,Yvonne Chou,Arnavi Chheda,Jenna Sparks,Sam Skjonsberg,Michael Schmitz,Aaron Sarnat,Byron Bischoff,Pete Walsh,Chris Newell,Piper Wolters,Tanmay Gupta,Kuo-Hao Zeng,Jon Borchardt,Dirk Groeneveld,Crystal Nam,Sophie Lebrecht,Caitlin Wittlif,Carissa Schoenick,Oscar Michel,Ranjay Krishna,Luca Weihs,Noah A. Smith,Hannaneh Hajishirzi,Ross Girshick,Ali Farhadi,Aniruddha Kembhavi

from arxiv, Updated with ablations and more technical details

Today's most advanced vision-language models (VLMs) remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed VLMs into open ones. As a result, the community has been missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. The success of our approach relies on careful modeling choices, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets. Our best-in-class 72B model not only outperforms others in the class of open weight and data models, but also outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini 1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and on a large human evaluation. Our model weights, new datasets, and source code are available at https://molmo.allenai.org/blog.

翻译：当前最先进的视觉语言模型（VLMs）仍属于专有系统。性能最强的开源权重模型严重依赖专有VLMs生成的合成数据来实现优异性能，本质上是通过蒸馏将封闭式VLMs转化为开放式模型。这导致学术界长期缺乏从零构建高性能VLMs的基础性知识。本文提出Molmo系列模型——在开放程度同类模型中达到最先进水平的新型VLMs。我们的核心贡献是名为PixMo的全新数据集集合，包含用于预训练的高细节图像描述数据集、用于微调的自由形式图像问答数据集，以及创新的二维指向数据集，所有数据均在不借助外部VLMs的情况下采集完成。本方法的成功依赖于精细的建模选择、充分优化的训练流程，以及最关键的新建数据集质量。我们72B规模的顶尖模型不仅在开源权重与数据的同类模型中表现卓越，更超越了包括Claude 3.5 Sonnet、Gemini 1.5 Pro及Flash在内的更大规模专有模型，根据学术基准测试和大规模人工评估，其性能仅次于GPT-4o。我们的模型权重、新建数据集及源代码已发布于https://molmo.allenai.org/blog。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日