Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.
翻译:在复杂软件环境中运行的智能体能够从对其行为后果的推理中获益,因为即便是单个错误的用户界面(UI)操作也可能破坏需要长期维护工作产物的流程。这一挑战在计算机使用场景中尤为突出,由于实际执行过程不支持反事实探索,使得大规模试错式学习与规划变得不切实际——尽管该环境完全数字化且具有确定性。我们提出了计算机使用世界模型(CUWM),这是一个针对桌面软件的世界模型,能够根据当前状态及候选动作预测下一UI状态。CUWM采用两阶段因子分解法对UI动态进行建模:首先预测与智能体相关的状态变化的文本描述,随后将这些变化可视化以合成下一屏幕截图。CUWM基于从智能体与真实Microsoft Office应用程序交互过程中收集的离线UI转换数据进行训练,并通过轻量级强化学习阶段进一步优化,使文本转换预测与计算机使用环境的结构要求对齐。我们通过测试时动作搜索对CUWM进行评估:冻结的智能体在执行前使用世界模型模拟并比较候选动作。在一系列Office任务中,基于世界模型的测试时扩展显著提升了决策质量与执行鲁棒性。