Large Language Model (LLM)-based UI agents show great promise for UI automation but often hallucinate in long-horizon tasks due to their lack of understanding of the global UI transition structure. To address this, we introduce AGENT+P, a novel framework that leverages symbolic planning to guide LLM-based UI agents. Specifically, we model an app's UI transition structure as a UI Transition Graph (UTG), which allows us to reformulate the UI automation task as a pathfinding problem on the UTG. This further enables an off-the-shelf symbolic planner to generate a provably correct and optimal high-level plan, preventing the agent from redundant exploration and guiding the agent to achieve the automation goals. AGENT+P is designed as a plug-and-play framework to enhance existing UI agents. Evaluation on the AndroidWorld benchmark demonstrates that AGENT+P improves the success rates of state-of-the-art UI agents by up to 14.31% and reduces the action steps by 37.70%.
翻译:基于大语言模型(LLM)的用户界面(UI)智能体在UI自动化方面展现出巨大潜力,但由于缺乏对全局UI状态转移结构的理解,其在长周期任务中常产生幻觉。为解决此问题,我们提出了AGENT+P这一新颖框架,该框架利用符号规划来引导基于LLM的UI智能体。具体而言,我们将应用程序的UI状态转移结构建模为UI转移图(UTG),从而将UI自动化任务重新表述为UTG上的路径查找问题。这进一步使得现成的符号规划器能够生成可证明正确且最优的高层规划,避免智能体进行冗余探索,并引导其实现自动化目标。AGENT+P被设计为即插即用框架,可用于增强现有UI智能体。在AndroidWorld基准测试上的评估表明,AGENT+P将最先进UI智能体的成功率提升了最高14.31%,并将动作步骤减少了37.70%。