Understanding the Weakness of Large Language Model Agents within a Complex Android Environment

Large language models (LLMs) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. However, when applied to general-purpose software systems like operating systems, LLM agents face three primary challenges. Firstly, the action space is vast and dynamic, posing difficulties for LLM agents to maintain an up-to-date understanding and deliver accurate responses. Secondly, real-world tasks often require inter-application cooperation}, demanding farsighted planning from LLM agents. Thirdly, agents need to identify optimal solutions aligning with user constraints, such as security concerns and preferences. These challenges motivate AndroidArena, an environment and benchmark designed to evaluate LLM agents on a modern operating system. To address high-cost of manpower, we design a scalable and semi-automated method to construct the benchmark. In the task evaluation, AndroidArena incorporates accurate and adaptive metrics to address the issue of non-unique solutions. Our findings reveal that even state-of-the-art LLM agents struggle in cross-APP scenarios and adhering to specific constraints. Additionally, we identify a lack of four key capabilities, i.e., understanding, reasoning, exploration, and reflection, as primary reasons for the failure of LLM agents. Furthermore, we provide empirical analysis on the failure of reflection, and improve the success rate by 27% with our proposed exploration strategy. This work is the first to present valuable insights in understanding fine-grained weakness of LLM agents, and offers a path forward for future research in this area. Environment, benchmark, and evaluation code for AndroidArena are released at https://github.com/AndroidArenaAgent/AndroidArena.

翻译：大型语言模型（LLM）已赋能智能代理在浏览器、游戏等特定领域软件中执行复杂任务。然而，当应用于操作系统这类通用软件系统时，LLM代理面临三大主要挑战。首先，动作空间庞大且动态变化，导致LLM代理难以保持最新理解并给出准确响应。其次，现实任务常需跨应用协作，要求LLM代理具备前瞻性规划能力。第三，代理需识别符合用户约束（如安全偏好）的最优方案。这些挑战催生了AndroidArena——一个专为评估现代操作系统上LLM代理而设计的环境与基准平台。为降低高额人力成本，我们提出可扩展的半自动化基准构建方法。在任务评估中，AndroidArena采用精准自适应的评价指标以应对非唯一解问题。研究发现，即使最先进的LLM代理在跨应用场景及约束遵循方面仍存在困难。此外，我们识别出理解、推理、探索与反思四大关键能力的缺失是导致LLM代理失败的主因。进一步地，我们对反思失败环节进行实证分析，并通过提出的探索策略将成功率提升27%。本文首次提供了理解LLM代理微观弱点的宝贵见解，为该领域未来研究指明方向。AndroidArena的环境、基准及评估代码已开源：https://github.com/AndroidArenaAgent/AndroidArena。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日