Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LLMs) for effective engagement in diverse environments. To align with the input-output requirement of LLMs, most existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-GUI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique -- leveraging a series of intermediate previous action histories and future action plans -- to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-GUI achieves state-of-the-art performance with an action type prediction accuracy of 90\% and an overall action success rate of 74\%. Code is publicly available at https://github.com/cooelf/Auto-GUI.
翻译:自主图形用户界面(GUI)智能体旨在通过无需人工干预的方式与用户界面交互,以实现任务自动化。近期研究探索了激发大型语言模型(LLM)在多样化环境中有效交互的潜力。为满足LLM的输入输出要求,现有方法大多在沙箱环境下开发,依赖外部工具和特定应用编程接口(API)将环境解析为文本元素并解释预测动作。因此,这些方法常面临推理效率低下和错误传播风险的问题。为应对这些挑战,我们提出了Auto-GUI——一种直接与界面交互的多模态解决方案,无需环境解析或依赖特定应用API。此外,我们提出了一种行动链技术——通过利用一系列中间历史行动记录与未来行动计划——辅助智能体决策执行何种动作。我们在包含30$K$条独立指令的新设备控制基准测试AITW上评估了该方法,其涵盖应用程序操作、网页搜索和网络购物等多步骤任务。实验结果表明,Auto-GUI实现了最先进的性能,动作类型预测准确率达90\%,整体动作成功率为74\%。代码已公开于https://github.com/cooelf/Auto-GUI。