Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments, especially for graphical user interface (GUI) automation. However, those GUI agents require comprehensive cognition ability including exhaustive perception and reliable action response. We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP), to systematically improve the GUI automation performance. First, CEP facilitates the GUI perception through different aspects and granularity, including screenshots and complementary detailed layouts for the visual channel and historical actions for the textual channel. Second, CAP decomposes the action prediction into sub-problems: action type prediction and action target conditioned on the action type. With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios. Code is available at https://github.com/xbmxb/CoCo-Agent.
翻译:多模态大语言模型(MLLMs)作为类人自主语言智能体,在与现实世界环境交互方面展现出巨大潜力,尤其在图形用户界面(GUI)自动化领域。然而,此类GUI智能体需要全面的认知能力,包括详尽的感知和可靠的动作响应。我们提出了一种全面认知大语言模型智能体——CoCo-Agent,其采用两种新颖方法,即全面环境感知(CEP)和条件动作预测(CAP),以系统性提升GUI自动化性能。首先,CEP通过不同维度和粒度促进GUI感知,包括针对视觉通道的屏幕截图与互补的详细布局信息,以及针对文本通道的历史操作记录。其次,CAP将动作预测分解为子问题:动作类型预测,以及基于动作类型的动作目标预测。通过我们的技术设计,我们的智能体在AITW和META-GUI基准测试中取得了新的最先进性能,展现了在真实场景中的卓越能力。代码发布于 https://github.com/xbmxb/CoCo-Agent。