The Graphical User Interface (GUI) is pivotal for human interaction with the digital world, enabling efficient device control and the completion of complex tasks. Recent progress in Large Language Models (LLMs) and Vision Language Models (VLMs) offers the chance to create advanced GUI agents. To ensure their effectiveness, there's a pressing need for qualified benchmarks that provide trustworthy and reproducible evaluations -- a challenge current benchmarks often fail to address. To tackle this issue, we introduce Mobile-Env, a comprehensive toolkit tailored for creating GUI benchmarks in the Android mobile environment. Mobile-Env offers an isolated and controllable setting for reliable evaluations, and accommodates intermediate instructions and rewards to reflect real-world usage more naturally. Utilizing Mobile-Env, we collect an open-world task set across various real-world apps and a fixed world set, WikiHow, which captures a significant amount of dynamic online contents for fully controllable and reproducible evaluation. We conduct comprehensive evaluations of LLM agents using these benchmarks. Our findings reveal that even advanced models (e.g., GPT-4V and LLaMA-3) struggle with tasks that are relatively simple for humans. This highlights a crucial gap in current models and underscores the importance of developing more capable foundation models and more effective GUI agent frameworks.
翻译:图形用户界面(GUI)是人类与数字世界交互的关键,它支持高效的设备控制与复杂任务的完成。大语言模型(LLM)与视觉语言模型(VLM)的最新进展为开发先进的GUI智能体提供了机遇。为确保其有效性,迫切需要能够提供可信且可复现评估的合格基准——而当前基准往往难以满足这一需求。为解决此问题,我们推出了Mobile-Env,一个专为Android移动环境创建GUI评估基准而设计的综合性工具包。Mobile-Env提供了一个隔离且可控的环境以支持可靠评估,并允许引入中间指令与奖励机制,从而更自然地反映真实使用场景。利用Mobile-Env,我们收集了一个涵盖多种真实应用的开源任务集,以及一个固定任务集WikiHow,后者捕获了大量动态在线内容,可用于完全可控且可复现的评估。我们使用这些基准对大语言模型智能体进行了全面评估。研究结果表明,即使先进模型(如GPT-4V与LLaMA-3)在处理对人类相对简单的任务时仍存在困难。这揭示了当前模型的关键能力差距,并强调了开发更强大的基础模型与更有效的GUI智能体框架的重要性。