Large language models (LLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments, especially for graphical user interface (GUI) automation. However, those GUI agents require comprehensive cognition ability including exhaustive perception and reliable action response. We propose \underline{Co}mprehensive \underline{Co}gnitive LLM \underline{Agent}, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP), to systematically improve the GUI automation performance. First, CEP facilitates the GUI perception through different aspects and granularity, including screenshots and complementary detailed layouts for the visual channel and historical actions for the textual channel. Second, CAP decomposes the action prediction into sub-problems: action type prediction and action target conditioned on the action type. With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
翻译:大型语言模型在作为类人自主语言智能体与现实环境交互方面展现出显著潜力,尤其体现在图形用户界面自动化领域。然而,此类GUI智能体需要具备包括全面感知与可靠动作响应的综合认知能力。我们提出了综合认知大语言模型智能体(CoCo-Agent),通过环境综合感知(CEP)与条件动作预测(CAP)两种创新方法,系统性地提升GUI自动化性能。首先,CEP通过多维度多粒度方式增强GUI感知能力,包括视觉通道的截图与互补性详细布局信息,以及文本通道的历史操作记录。其次,CAP将动作预测分解为两个子问题:动作类型预测与基于动作类型的动作目标预测。通过我们的技术设计,该智能体在AITW和META-GUI基准测试中取得了新的最佳性能,在真实场景中展现出卓越的应用潜力。