Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of these agents on standard benchmarks has raised expectations for large-scale real-world deployment, and there are already several commercial agents released and used by early adopters. However, are we really ready for GUI agents integrated into our daily devices as system building blocks? We argue that an important pre-deployment validation is missing to examine whether the agents can maintain their performance under real-world threats. Specifically, unlike existing common benchmarks that are based on simple static app contents (they have to do so to ensure environment consistency between different tests), real-world apps are filled with contents from untrustworthy third parties, such as advertisement emails, user-generated posts and medias, etc. ... To this end, we introduce a scalable app content instrumentation framework to enable flexible and targeted content modifications within existing applications. Leveraging this framework, we create a test suite comprising both a dynamic task execution environment and a static dataset of challenging GUI states. The dynamic environment encompasses 122 reproducible tasks, and the static dataset consists of over 3,000 scenarios constructed from commercial apps. We perform experiments on both open-source and commercial GUI agents. Our findings reveal that all examined agents can be significantly degraded due to third-party contents, with an average misleading rate of 42.0% and 36.1% in dynamic and static environments respectively. The framework and benchmark has been released at https://agenthazard.github.io.
翻译:近年来,基于大型语言模型的移动GUI代理迅速发展,能够根据自然语言指令自主执行多种设备控制任务。这些代理在标准基准测试中的准确率不断提升,引发了对其大规模实际部署的期待,已有多个商业代理发布并被早期用户采用。然而,我们是否真的准备好将GUI代理作为系统组件集成到日常设备中?我们认为,在部署前缺少一项重要验证——评估代理在真实世界威胁下能否维持其性能。具体而言,与现有基于简单静态应用内容的常见基准测试(为确保测试间环境一致性而必须如此)不同,真实世界的应用充满了来自不可信第三方的内容,例如广告邮件、用户生成的帖子和媒体等。为此,我们引入了一个可扩展的应用内容注入框架,支持在现有应用内灵活且定向地修改内容。利用该框架,我们构建了一个测试套件,包含动态任务执行环境和静态挑战性GUI状态数据集。动态环境涵盖122个可复现任务,静态数据集由从商业应用构建的3000多个场景组成。我们对开源和商业GUI代理进行了实验。结果表明,所有受测代理均因第三方内容显著降级,在动态和静态环境中的平均误导率分别为42.0%和36.1%。该框架和基准测试已发布在https://agenthazard.github.io。