We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.
翻译:我们提出开放计算机(OpenComputer),一种基于验证器的框架,用于为计算机使用智能体构建可验证的软件世界。开放计算机整合了四个组件:(1)针对特定应用的软件状态验证器,可在真实应用上暴露结构化检查端点;(2)自我演进的验证层,利用执行驱动的反馈提升验证器可靠性;(3)任务生成流水线,可合成真实且机器可检查的桌面任务;(4)评估工具集,可记录完整轨迹并计算可审计的部分学分奖励。当前版本的开放计算机覆盖33款桌面应用及1000个精炼任务,涵盖浏览器、办公工具、创意软件、开发环境、文件管理器和通信应用。实验表明,相比大语言模型作为评判的评估方式,开放计算机的硬编码验证器与人类裁决的吻合度更高——尤其当任务成功取决于细粒度应用状态时。前沿智能体虽能取得部分进展,但端到端完成任务仍显吃力;开源模型在OSWorld验证分数上出现显著下降,揭示了稳健计算机自动化领域的持续差距。