Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning--acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.
翻译:诸如Scratch这样的积木式编程环境在低代码教育中扮演着核心角色,然而,评估人工智能智能体通过图形用户界面构建程序的能力仍未得到充分探索。我们提出了ScratchWorld,一个用于评估多模态图形用户界面智能体在Scratch中通过构造完成编程任务的基准。该基准基于“使用-修改-创造”教学框架构建,包含83个精心设计的任务,涵盖四个不同的问题类别:创建、调试、扩展和计算。为了严格诊断智能体失败的根源,该基准采用了两种互补的交互模式:原始模式要求细粒度的拖放操作,以直接评估视觉运动控制能力;而复合模式则使用高层语义API,将程序推理与图形用户界面执行分离开来。为确保评估的可靠性,我们提出了一种基于执行的评估协议,通过在浏览器环境中运行测试来验证所构建的Scratch程序的功能正确性。对多种最先进的多模态语言模型和图形用户界面智能体进行的大量实验揭示了一个显著的推理-行动差距,突显出尽管具备强大的规划能力,但在细粒度图形用户界面操作方面仍存在持续挑战。