ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

Modern computer-use agents (CUA) must perceive a screen as a structured state, what elements are visible, where they are, and what text they contain, before they can reliably ground instructions and act. Yet, most available grounding datasets provide sparse supervision, with insufficient and low-diversity labels that annotate only a small subset of task-relevant elements per screen, which limits both coverage and generalization; moreover, practical deployment requires efficiency to enable low-latency, on-device use. We introduce ScreenParse, a large-scale dataset for complete screen parsing, with dense annotations of all visible UI elements (boxes, 55-class types, and text) across 771K web screenshots (21M elements). ScreenParse is generated by Webshot, an automated, scalable pipeline that renders diverse urls, extracts annotations and applies VLM-based relabeling and quality filtering. Using ScreenParse, we train ScreenVLM, a compact, 316M-parameter vision language model (VLM) that decodes a compact ScreenTag markup representation with a structure-aware loss that upweights structure-critical tokens. ScreenVLM substantially outperforms much larger foundation VLMs on dense parsing (e.g., 0.592 vs. 0.294 PageIoU on ScreenParse) and shows strong transfer to public benchmarks. Moreover, finetuning foundation VLMs on ScreenParse consistently improves their grounding performance, suggesting that dense screen supervision provides transferable structural priors for UI understanding. Project page: https://saidgurbuz.github.io/screenparse/.

翻译：现代计算机使用代理需将屏幕感知为结构化状态（识别可见元素、定位坐标及文本内容），方能可靠执行指令并完成交互。然而现有定位数据集普遍存在监督稀疏问题：仅标注每屏中与任务相关的少量元素，标签覆盖不足且多样性低，限制了模型覆盖范围与泛化能力；实际部署还需兼顾低延迟的端侧效率。为此，我们提出大规模完整屏幕解析数据集ScreenParse，包含771K张网页截图（2100万个元素）的密集标注，涵盖所有可见UI元素的边界框、55类属性及文本。该数据集由自动化可扩展流水线Webshot生成，通过渲染多样化URL、提取标注并实施基于视觉语言模型的标签重标注与质量过滤。基于ScreenParse，我们训练了仅含3.16亿参数的紧凑视觉语言模型ScreenVLM，其采用结构感知损失解码压缩的ScreenTag标记表征——通过提升结构关键标记的权重优化输出。在密集解析任务中（如ScreenParse测试集上0.592 vs. 0.294的页面交并比），ScreenVLM显著超越更大规模的基础视觉语言模型，并展现出向公开基准的强迁移能力。进一步实验表明，在ScreenParse上微调基础视觉语言模型可持续提升其定位性能，证实密集屏幕监督能为UI理解提供可迁移的结构先验。项目主页：https://saidgurbuz.github.io/screenparse/