Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing studies typical consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon an off-the-shell LLM, CoAT significantly improves the goal progress compared to standard context modeling. To further facilitate the research in this line, we construct a benchmark Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action-thought annotations. Experiments show that fine-tuning a 200M model on our AitZ dataset achieves on par performance with CogAgent-Chat-18B.
翻译:大型语言模型(LLM)推动了智能手机自主GUI代理的兴起,这类代理通过预测API动作序列,以自然语言触发任务执行。尽管此类任务高度依赖历史动作与视觉观察,现有研究却鲜少关注中间截图与屏幕操作所承载的语义信息。为解决这一问题,本文提出链式动作推理(简称CoAT),该方法不仅整合了先前动作描述、当前屏幕状态,更关键的是融入了动作推理——即应执行何种操作以及所选操作导致的后果。我们证明,在基于现成LLM的零样本设置下,与标准上下文建模相比,CoAT显著提升了任务目标完成进度。为促进该领域研究,我们构建了Android-In-The-Zoo(AitZ)基准数据集,包含18,643个屏幕-动作对及其链式动作推理标注。实验表明,在AitZ数据集上微调200M参数模型,其性能可媲美CogAgent-Chat-18B。