Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.
翻译:训练计算机使用智能体需要大量图形用户界面交互数据,但大规模手动标注操作轨迹成本极高。本文提出VideoAgentTrek,一种可扩展的流水线方法,能够从网络规模的公开屏幕录制视频中自动挖掘训练数据,无需人工标注。我们的方法解决了一个关键挑战:原始视频包含隐式演示但缺乏显式动作标签。为此,我们开发了Video2Action——包含两个组件的逆向动力学模块:(1)视频定位模型,用于检测和定位具有精确时间边界及上下文的GUI动作;(2)动作内容识别器,能够高保真地提取点击坐标、输入文本等结构化参数。将该流水线应用于39,000个YouTube教程视频,自动生成了152万次交互步骤。我们通过持续预训练与监督微调相结合的方式利用这些数据。在OSWorld-Verified基准上,我们的方法将任务成功率从9.3%(仅监督微调基线)提升至15.8%,相对提升达70%。在AgentNetBench基准上,步骤准确率从64.1%提高至69.3%。实验结果表明,被动获取的网络视频能够转化为计算机使用智能体的高质量监督信号,为昂贵的人工标注提供了可扩展的替代方案。