The advent of large language models (LLMs) has spurred considerable interest in advancing autonomous LLMs-based agents, particularly in intriguing applications within smartphone graphical user interfaces (GUIs). When presented with a task goal, these agents typically emulate human actions within a GUI environment until the task is completed. However, a key challenge lies in devising effective plans to guide action prediction in GUI tasks, though planning have been widely recognized as effective for decomposing complex tasks into a series of steps. Specifically, given the dynamic nature of environmental GUIs following action execution, it is crucial to dynamically adapt plans based on environmental feedback and action history.We show that the widely-used ReAct approach fails due to the excessively long historical dialogues. To address this challenge, we propose a novel approach called Dynamic Planning of Thoughts (D-PoT) for LLM-based GUI agents.D-PoT involves the dynamic adjustment of planning based on the environmental feedback and execution history. Experimental results reveal that the proposed D-PoT significantly surpassed the strong GPT-4V baseline by +12.7% (34.66% $\rightarrow$ 47.36%) in accuracy. The analysis highlights the generality of dynamic planning in different backbone LLMs, as well as the benefits in mitigating hallucinations and adapting to unseen tasks. Code is available at https://github.com/sqzhang-lazy/D-PoT.
翻译:大型语言模型(LLM)的出现极大地推动了基于LLM的自主智能体的发展,尤其是在智能手机图形用户界面(GUI)这一引人注目的应用领域。当接收到任务目标时,这类智能体通常会在GUI环境中模拟人类操作,直至任务完成。然而,一个关键挑战在于如何制定有效的规划来指导GUI任务中的动作预测,尽管规划已被广泛认为是将复杂任务分解为一系列步骤的有效方法。具体而言,考虑到执行动作后环境GUI的动态变化特性,基于环境反馈和动作历史动态调整规划至关重要。我们发现,广泛使用的ReAct方法因历史对话过长而失效。为应对这一挑战,我们提出了一种名为“思维动态规划”(Dynamic Planning of Thoughts, D-PoT)的新方法,用于基于LLM的GUI智能体。D-PoT能够根据环境反馈和执行历史动态调整规划。实验结果表明,所提出的D-PoT方法在准确率上显著超越了强大的GPT-4V基线,提升了+12.7%(从34.66%提升至47.36%)。分析结果突显了动态规划在不同骨干LLM中的普适性,以及在缓解幻觉和适应未见任务方面的优势。代码发布于 https://github.com/sqzhang-lazy/D-PoT。