Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.
翻译:基于大型基础模型的图形用户界面(GUI)代理已成为自动化人机交互的一种变革性方法。这些代理通过 GUI 自主与数字系统或软件应用进行交互,模拟人类在不同平台上的点击、输入和导航视觉元素等行为。鉴于对 GUI 代理日益增长的兴趣及其基础重要性,本文提供了一项全面的综述,对其基准测试、评估指标、架构和训练方法进行了分类。我们提出了一个统一的框架,用以描述其感知、推理、规划和执行能力。此外,我们指出了重要的开放挑战并讨论了关键的未来方向。最后,本文为从业者和研究人员提供了一个基础,以直观地理解当前进展、技术、基准以及有待解决的关键开放性问题。