Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
翻译:当前最强的视频语言模型(VLM)仍为专有模型。最强开放权重模型或依赖专有VLM生成的合成数据(实质上是蒸馏过程),或未公开其训练数据与方法。因此,开源社区缺乏改进当前顶尖视频(和图像)语言模型所需的基础。关键的是,许多下游应用不仅需要高层视频理解,还需要定位能力——无论是通过指向还是像素级追踪。即便是专有模型也缺乏此能力。我们提出Molmo2,新型VLM系列,在开源模型中达到顶尖水平,并在单图、多图及视频任务的点驱动定位中展现出卓越的新能力。核心贡献是7个新视频数据集与2个多图像数据集,包括用于预训练的高细节视频描述数据集、用于微调的自由格式视频问答数据集、含复杂查询的新型物体追踪数据集,以及创新视频指向数据集——所有数据集均未使用闭源VLM生成。我们还提出利用高效打包与消息树编码方案的数据训练方法,并展示视觉令牌的双向注意力机制及新型令牌加权策略可提升性能。最优的8B模型在短视频、计数与描述任务中超越同类开放权重与数据模型,在长视频任务中具备竞争力。在视频定位上,Molmo2显著超越Qwen3-VL等现有开放权重模型(视频计数准确率35.5 vs 29.6),并在某些任务上超越专有模型如Gemini 3 Pro(视频指向F1值38.4 vs 20.0,视频追踪J&F值56.2 vs 41.1)。