Computer-use agents (CUAs) that interact with real computer systems can perform automated tasks but face critical safety risks. Ambiguous instructions may trigger harmful actions, and adversarial users can manipulate tool execution to achieve malicious goals. Existing benchmarks mostly focus on short-horizon or GUI-based tasks, evaluating on execution-time errors but overlooking the ability to anticipate planning-time risks. To fill this gap, we present LPS-Bench, a benchmark that evaluates the planning-time safety awareness of MCP-based CUAs under long-horizon tasks, covering both benign and adversarial interactions across 65 scenarios of 7 task domains and 9 risk types. We introduce a multi-agent automated pipeline for scalable data generation and adopt an LLM-as-a-judge evaluation protocol to assess safety awareness through the planning trajectory. Experiments reveal substantial deficiencies in existing CUAs' ability to maintain safe behavior. We further analyze the risks and propose mitigation strategies to improve long-horizon planning safety in MCP-based CUA systems. We open-source our code at https://github.com/tychenn/LPS-Bench.
翻译:能够与真实计算机系统交互的计算机使用智能体(CUAs)可以执行自动化任务,但也面临着关键的安全风险。模糊的指令可能触发有害操作,而对抗性用户可能操纵工具执行以实现恶意目标。现有基准测试大多关注短程或基于图形用户界面(GUI)的任务,主要评估执行时错误,却忽视了预测规划时风险的能力。为填补这一空白,我们提出了LPS-Bench,这是一个用于评估基于MCP的CUAs在长程任务下规划时安全意识的基准测试,涵盖7个任务领域和9种风险类型的65个场景中的良性与对抗性交互。我们引入了一个多智能体自动化流水线以实现可扩展的数据生成,并采用LLM-as-a-judge评估协议,通过规划轨迹来评估安全意识。实验揭示了现有CUAs在维持安全行为方面存在显著不足。我们进一步分析了相关风险,并提出了缓解策略,以提升基于MCP的CUA系统的长程规划安全性。我们的代码已在 https://github.com/tychenn/LPS-Bench 开源。