Large Language Model-Brained GUI Agents: A Survey

from arxiv, The collection of papers reviewed in this survey will be hosted and regularly updated on the GitHub repository: https://github.com/vyokky/LLM-Brained-GUI-Agents-Survey Additionally, a searchable webpage is available at https://aka.ms/gui-agent for easier access and exploration

GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.

翻译：图形用户界面（GUI）长期以来一直是人机交互的核心，为用户提供了一种直观且视觉驱动的访问和交互数字系统的方式。大型语言模型（LLM），特别是多模态模型的出现，开启了GUI自动化的新纪元。这些模型在自然语言理解、代码生成和视觉处理方面展现出卓越能力，从而为新一代由LLM赋能的GUI智能体铺平了道路。这类智能体能够解读复杂的GUI元素，并根据自然语言指令自主执行操作，代表了人机交互范式的转变。它们使得用户能够通过简单的对话式命令完成复杂、多步骤的任务，其应用范围涵盖网页浏览、移动应用交互和桌面自动化，提供了一种变革性的用户体验，从根本上改变了个人与软件的交互方式。这一新兴领域正在快速发展，在学术界和工业界均取得了显著进展。为系统性地理解这一趋势，本文对LLM赋能的GUI智能体进行了全面综述，探讨了其历史演进、核心组件和先进技术。我们探讨了若干关键研究问题，包括现有的GUI智能体框架、用于训练专用GUI智能体的数据收集与利用方法、针对GUI任务定制的大型动作模型的开发，以及评估其效能所需的指标与基准。此外，本文还审视了由这些智能体驱动的新兴应用。通过深入分析，本综述指出了当前研究中的关键空白，并勾勒了该领域未来发展的路线图。通过整合基础知识和前沿进展，本文旨在为研究者和实践者提供指导，助力他们克服挑战，充分释放LLM赋能GUI智能体的潜力。