Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku and Opus. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.
翻译:尽管计算机使用代理(CUAs)在自动化日益复杂的操作系统工作流程方面具有巨大潜力,但即使在良性输入环境下,它们仍可能表现出偏离预期结果的不安全非预期行为。然而,对此类风险的探索目前主要停留在个案层面,缺乏具体的特征描述以及能够在现实CUA场景下主动发现长尾非预期行为的自动化方法。为填补这一空白,我们首次提出了针对非预期CUA行为的概念与方法论框架,通过定义其关键特征、自动引发这些行为,并分析其如何从良性输入中产生。我们提出了AutoElicit:一个基于代理的框架,它利用CUA执行反馈迭代地扰动良性指令,在保持扰动现实且良性的同时,引发严重危害。借助AutoElicit,我们从Claude 4.5 Haiku和Opus等前沿CUAs中发现了数百种有害的非预期行为。我们进一步评估了经人工验证的成功扰动的可迁移性,发现多种其他前沿CUAs对非预期行为存在持续的易感性。本研究为系统分析现实计算机使用环境中的非预期行为奠定了基础。