Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke assistance mid-workflow, where software may be partially configured, steps may have been executed in different orders, or the interface may differ from its default setup. Such task-state variability is pervasive but insufficiently evaluated in existing GUI benchmarks. To address this gap, we introduce WorldGUI, a benchmark covering ten widely used desktop and web applications with tasks instantiated under diverse, systematically constructed initial states. These variations capture realistic human-computer interaction settings and enable diagnostic evaluation of an agent's ability to recover, adapt plans, and handle non-default contexts. We further present WorldGUI-Agent, a simple and model-agnostic framework that organizes planning and execution around three critique stages, improving reliability in dynamic environments. Experiments demonstrate that state-of-the-art GUI agents exhibit substantial performance degradation under non-default initial conditions, revealing limited robustness and fragile planning behaviors. Our benchmark and framework provide a foundation for developing more adaptable and reliable GUI agents. The code and data are available at https://github.com/showlab/WorldGUI.
翻译:近期图形用户界面智能体的研究在视觉定位方面取得显著进展,但鲁棒的任务规划仍面临挑战,尤其在环境偏离规范初始状态时更为突出。在实际应用中,用户常在任务流程中途调用辅助功能,此时软件可能处于部分配置状态、操作步骤可能以不同顺序执行、或界面可能偏离默认设置。此类任务状态变异在现有图形用户界面基准中普遍存在但评估不足。为填补这一空白,我们提出WorldGUI基准,涵盖十款广泛使用的桌面与网络应用程序,其任务实例均构建于多样化、系统化设计的初始状态。这些变异捕捉了真实人机交互场景,支持对智能体状态恢复、计划调整及非默认上下文处理能力的诊断性评估。我们进一步提出WorldGUI-Agent——一个简洁且模型无关的框架,通过三重校验阶段组织规划与执行过程,提升动态环境下的可靠性。实验表明,当前最先进的图形用户界面智能体在非默认初始条件下均出现显著性能衰减,暴露出有限的鲁棒性与脆弱的规划行为。本基准与框架为开发更具适应性与可靠性的图形用户界面智能体奠定了基础。代码与数据公开于https://github.com/showlab/WorldGUI。