Autonomous user interface (UI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LLMs) for effective engagement in diverse environments. To align with the input-output requirement of LLMs, existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-UI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique -- leveraging a series of intermediate previous action histories and future action plans -- to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30K unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-UI achieves state-of-the-art performance with an action type prediction accuracy of 90% and an overall action success rate of 74%. Code is publicly available at https://github.com/cooelf/Auto-UI.
翻译:自主用户界面(UI)智能体旨在通过直接与用户界面交互实现任务自动化,无需人工干预。近年研究探索激发大语言模型(LLMs)的能力,使其能有效参与多样化环境。为适配LLMs的输入输出要求,现有方法通常在沙盒环境下开发,依赖外部工具和应用特定API将环境解析为文本元素,并解读预测的动作。因此,这些方法常面临推理效率低下和误差传播风险。为缓解上述挑战,我们提出Auto-UI——一种多模态解决方案,可直接与界面交互,无需环境解析或依赖应用专用API。此外,我们提出链式动作技术——利用一系列中间历史动作与未来动作计划——帮助智能体决定执行何种动作。我们在包含3万条独特指令的新设备控制基准AITW上评估本方法,涵盖应用操作、网页搜索与网页购物等多步骤任务。实验结果显示,Auto-UI取得了动作类型预测准确率90%与整体动作成功率达74%的最优性能。代码已在https://github.com/cooelf/Auto-UI 公开。