Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4\% of the tasks, the only model above 50\%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at https://mypcbench.com.
翻译:当前的计算机使用代理基准测试在非个性化环境中评估模型,这导致评估与部署之间存在差距——个人助理需要覆盖用户的整个数字生活,包括其上下文、历史数据和已登录账户。这一差距在网络任务中最为显著:实时网络评估无法测试需要登录或使用个人信息的网站,而真正的个人助理必须处理这类网站。我们提出了MyPCBench,该基准测试在一个安装了17个模拟真实世界Web应用及完整桌面环境的Linux桌面系统中,将计算机使用代理作为个人助理进行评估,所有内容均以《办公室》中的迈克尔·斯科特这一经典角色为基础生成数据。我们在此环境中定义了184项任务,每项任务均源于OpenClaw社区提出的真实需求,并通过统一的计算机+Bash工具接口对六个闭源和开源模型进行了基准测试。研究发现,最佳模型Claude Opus 4.6仅能完全解决55.4%的任务,是唯一超过50%的模型。模型失败主要集中在跨多个应用的任务及长路径任务中,此时个性化对助理的挑战最大。我们已在https://mypcbench.com发布环境、任务集和代理工具集。