Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku, Claude 4.5 Opus, and Operator. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.
翻译:尽管计算机使用智能体(CUA)在自动化日益复杂的操作系统工作流方面具有巨大潜力,但在良性输入环境下,它们可能表现出偏离预期结果的不安全意外行为。然而,对此类风险的探索目前主要基于零散案例,缺乏具体表征和自动化方法来主动揭示现实CUA场景中长尾分布的意外行为。为填补这一空白,我们首次提出了面向CUA意外行为的概念与方法论框架,通过定义其关键特征、自动诱发机制以及分析良性输入如何引发这些行为。我们提出AutoElicit:一种利用CUA执行反馈迭代扰动良性指令、在保持扰动真实性与良性的同时诱发严重危害的智能体框架。通过AutoElicit,我们从Claude 4.5 Haiku、Claude 4.5 Opus和Operator等最先进CUA中发现了数百种有害意外行为。我们进一步评估了经人工验证的成功扰动在不同前沿CUA间的可迁移性,揭示了各模型对意外行为的持续易感性。本研究为在真实计算机使用场景中系统分析意外行为奠定了基础。