Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing studies typically consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling. To further facilitate the research in this line, we construct a dataset Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action-thought annotations. Experiments show that fine-tuning a 1B model (i.e. AUTO-UI-base) on our AitZ dataset achieves on-par performance with CogAgent-Chat-18B.
翻译:大语言模型(LLM)的兴起引发了智能手机自主图形用户界面(GUI)智能体的热潮,这类智能体通过预测一系列API操作序列来完成由自然语言触发的任务。尽管任务高度依赖于过去的操作和视觉观察,现有研究通常很少考虑中间屏幕截图和屏幕操作所承载的语义信息。为此,本研究提出了行动-思维链(简称CoAT),它综合考虑了先前操作的描述、当前屏幕状态,更重要的是纳入了关于应执行何种操作以及所选操作将导致何种结果的行动思考。我们证明,在三种现成大型多模态模型(LMM)的零样本设置下,与先前提出的上下文建模方法相比,CoAT显著提升了操作预测的准确性。为持续推进该方向的研究,我们构建了Android-In-The-Zoo(AitZ)数据集,其中包含18,643个屏幕-操作对及其对应的行动-思维链标注。实验表明,基于我们的AitZ数据集对10亿参数模型(即AUTO-UI-base)进行微调,可获得与CogAgent-Chat-180亿参数模型相当的性能表现。