MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Tanmay Gupta,Piper Wolters,Zixian Ma,Peter Sushko,Rock Yuren Pang,Diego Llanes,Yue Yang,Taira Anderson,Boyuan Zheng,Zhongzheng Ren,Harsh Trivedi,Taylor Blanton,Caleb Ouellette,Winson Han,Ali Farhadi,Ranjay Krishna

from arxiv, https://allenai.org/blog/molmoweb

Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.

翻译：网络智能体——能代理用户在网络上自主导航并执行任务的系统——有望改变人与数字世界的交互方式。然而，当前最具能力的网络智能体依赖专有模型，其训练数据和配方未公开，限制了科学理解、可复现性及社区驱动的发展进程。我们认为，面向开放网络的智能体应以开放方式构建。为此，我们引入：（1）MolmoWebMix——一个大规模且多样化的浏览器任务演示与Web-GUI感知数据混合集；（2）MolmoWeb——全开放多模态网络智能体系列。具体而言，MolmoWebMix融合了来自多条互补生成流水线的超10万条合成任务轨迹、3万余条人类演示、原子级网络技能轨迹及GUI感知数据（包括指代表达定位与截图问答）。MolmoWeb智能体作为指令条件化的视觉-语言动作策略运行：给定任务指令与网页截图，即可预测下一个浏览器动作，无需访问HTML、无障碍树或专用API。该系列提供4B与8B参数规模版本，在WebVoyager、Online-Mind2Web及DeepShop等浏览器基准测试上取得最优结果，性能超越同量级的纯开源模型（如Fara-7B、UI-Tars-1.5-7B和Holo1-7B）。MolmoWeb-8B甚至优于基于更大规模闭面前沿模型（如GPT-4o）构建的标记放缩（SoM）智能体。我们进一步通过并行展开结合最优择（best-of-N）策略的测试时扩展，展示了持续的性能增益：在WebVoyager和Online-Mind2Web上，pass@4分别达到94.7%与60.5%（作为对比，pass@1分别为78.2%与35.3%）。我们将公开模型检查点、训练数据、代码及统一评估框架，以确保可复现性并加速网络智能体的开放研究。