Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Matt Deitke,Christopher Clark,Sangho Lee,Rohun Tripathi,Yue Yang,Jae Sung Park,Mohammadreza Salehi,Niklas Muennighoff,Kyle Lo,Luca Soldaini,Jiasen Lu,Taira Anderson,Erin Bransom,Kiana Ehsani,Huong Ngo,YenSung Chen,Ajay Patel,Mark Yatskar,Chris Callison-Burch,Andrew Head,Rose Hendrix,Favyen Bastani,Eli VanderBilt,Nathan Lambert,Yvonne Chou,Arnavi Chheda,Jenna Sparks,Sam Skjonsberg,Michael Schmitz,Aaron Sarnat,Byron Bischoff,Pete Walsh,Chris Newell,Piper Wolters,Tanmay Gupta,Kuo-Hao Zeng,Jon Borchardt,Dirk Groeneveld,Jen Dumas,Crystal Nam,Sophie Lebrecht,Caitlin Wittlif,Carissa Schoenick,Oscar Michel,Ranjay Krishna,Luca Weihs,Noah A. Smith,Hannaneh Hajishirzi,Ross Girshick,Ali Farhadi,Aniruddha Kembhavi

Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at https://molmo.allenai.org.

翻译：当前最先进的多模态模型仍为闭源系统。性能最强的开源权重模型严重依赖从专有视觉语言模型生成的合成数据来实现优异性能，本质上是通过蒸馏将闭源模型转化为开源模型。因此，学术界仍缺乏关于如何从零开始构建高性能视觉语言模型的基础性知识。本文提出Molmo系列模型——在同等开放程度类别中达到最先进水平的新型视觉语言模型。我们的核心创新在于通过语音描述方式，完全由人工标注者收集构建了一套新颖且高度精细的图像描述数据集。为支持多样化的用户交互，我们还引入了包含真实场景问答与创新性二维指向数据的多样化混合数据集用于微调。本方法的成功依赖于对模型架构细节的审慎设计、精心调优的训练流程，以及最关键的新建数据集质量，所有相关资源均将公开发布。Molmo系列中性能最佳的720亿参数模型不仅在开源权重与数据类别中表现卓越，在学术基准测试和人工评估中，相较于GPT-4o、Claude 3.5和Gemini 1.5等专有系统也展现出竞争优势。我们即将公开发布全部模型权重、图像描述与微调数据及源代码。精选模型权重、推理代码和演示系统已发布于https://molmo.allenai.org。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日