Large Language Model-Brained GUI Agents: A Survey

from arxiv, The collection of papers reviewed in this survey will be hosted and regularly updated on the GitHub repository: https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a searchable webpage is available at https://aka.ms/gui-agent for easier access and exploration

GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.

翻译：图形用户界面（GUI）长期以来一直是人机交互的核心，为用户提供了一种直观且视觉驱动的访问和操作数字系统的方式。大型语言模型（LLM），特别是多模态模型的出现，开启了GUI自动化的新时代。这些模型在自然语言理解、代码生成和视觉处理方面展现出卓越的能力。这为新一代由LLM赋能的GUI智能体铺平了道路，这些智能体能够解释复杂的GUI元素，并根据自然语言指令自主执行操作。这些智能体代表了一种范式转变，使用户能够通过简单的对话命令执行复杂、多步骤的任务。其应用范围涵盖网页导航、移动应用交互和桌面自动化，提供了一种变革性的用户体验，彻底改变了个体与软件的交互方式。这一新兴领域正在迅速发展，在学术界和工业界都取得了显著进展。为了提供对这一趋势的结构化理解，本文对LLM赋能的GUI智能体进行了全面综述，探讨了其历史演变、核心组件和先进技术。我们探讨了诸如现有GUI智能体框架、用于训练专用GUI智能体的数据收集与利用、为GUI任务定制的大型动作模型的开发，以及评估其有效性所需的评估指标和基准等研究问题。此外，我们还审视了由这些智能体驱动的新兴应用。通过详细分析，本综述指出了关键的研究空白，并勾勒了该领域未来发展的路线图。通过整合基础知识和最先进的发展，本工作旨在引导研究人员和实践者克服挑战，充分释放LLM赋能GUI智能体的潜力。