Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. However, when applied to general-purpose software systems like operating systems, LLM agents face three primary challenges. Firstly, the action space is vast and dynamic, posing difficulties for LLM agents to maintain an up-to-date understanding and deliver accurate responses. Secondly, real-world tasks often require inter-application cooperation}, demanding farsighted planning from LLM agents. Thirdly, agents need to identify optimal solutions aligning with user constraints, such as security concerns and preferences. These challenges motivate AndroidArena, an environment and benchmark designed to evaluate LLM agents on a modern operating system. To address high-cost of manpower, we design a scalable and semi-automated method to construct the benchmark. In the task evaluation, AndroidArena incorporates accurate and adaptive metrics to address the issue of non-unique solutions. Our findings reveal that even state-of-the-art LLM agents struggle in cross-APP scenarios and adhering to specific constraints. Additionally, we identify a lack of four key capabilities, i.e., understanding, reasoning, exploration, and reflection, as primary reasons for the failure of LLM agents. Furthermore, we provide empirical analysis on the failure of reflection, and improve the success rate by 27% with our proposed exploration strategy. This work is the first to present valuable insights in understanding fine-grained weakness of LLM agents, and offers a path forward for future research in this area. Environment, benchmark, and evaluation code for AndroidArena are released at https://github.com/AndroidArenaAgent/AndroidArena.
翻译:大型语言模型(LLM)已赋能智能代理在浏览器、游戏等特定领域软件中执行复杂任务。然而,当应用于操作系统这类通用软件系统时,LLM代理面临三大主要挑战。首先,动作空间庞大且动态变化,导致LLM代理难以保持最新理解并给出准确响应。其次,现实任务常需跨应用协作,要求LLM代理具备前瞻性规划能力。第三,代理需识别符合用户约束(如安全偏好)的最优方案。这些挑战催生了AndroidArena——一个专为评估现代操作系统上LLM代理而设计的环境与基准平台。为降低高额人力成本,我们提出可扩展的半自动化基准构建方法。在任务评估中,AndroidArena采用精准自适应的评价指标以应对非唯一解问题。研究发现,即使最先进的LLM代理在跨应用场景及约束遵循方面仍存在困难。此外,我们识别出理解、推理、探索与反思四大关键能力的缺失是导致LLM代理失败的主因。进一步地,我们对反思失败环节进行实证分析,并通过提出的探索策略将成功率提升27%。本文首次提供了理解LLM代理微观弱点的宝贵见解,为该领域未来研究指明方向。AndroidArena的环境、基准及评估代码已开源:https://github.com/AndroidArenaAgent/AndroidArena。