Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Liya Zhu,Jingzhe Ding,Jian Zhang,Jianbo Xue,Shihao Liang,Ge Zhang,Yi Zhu,Duju Zeng,Xiang Gao,Qingshui Gu,Mailun Gao,Huimin Che,Yan Zhao,Peiheng Zhou,Haojun Wang,Chaobo Xian,Lili Le,Chi Wu,Yiwei Liu,Shengda Long,Jiale Yang,Fangzhi Xu,Sijin Wu,Haodong Duan,Chao He,Zhaojian Li,Minchao Wang,Huan Zhou,Jiani Hou,Chuqian Yu,Weiran Shi,Hongwan Gao,Jiamin Chen,Guanhong Chen,Tingqin Luo,Kaiyuan Zhang,Zhixin Yao,Qing Hua,Yuhao Jiang,Jin Chen,Pu Chen,Zhenyu Hu,Xingyu Li,Zhengxuan Jiang,Meng Cao,Tianfeng Long,Haozhe Wang,Mingzhang Wang,Yichen Zhang,Yiming Dai,Chenchen Zhang,Jiaying Wang,Xinying Liu,Xingzu Liu,Lingling Zhang,Xinjie Chen,Yujia Qin,Wangchunshu Zhou,Zhiyong Wu,Yang Liu,Jiaheng Liu,Lei Zhang,Shen Yan,Wenhao Huang,Zaiyuan Wang,Xiaolong Chang

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

翻译：近年来，人工智能代理向处理日益复杂、真实世界任务的快速演进已为人所瞩目。然而，现有基准测试极少评估代理能否操纵图形用户界面以跨领域完成长周期、高价值的专业工作流程。当前GUI基准测试仍主要聚焦于通用软件、相对简单的应用及短周期任务，致使现代代理能否遵循用户指令、自主操作专业领域特定软件并端到端完成具有经济价值的工作基本未知。为弥合这一差距，我们提出Workflow-GYM——一个聚焦专业领域与专用软件环境的长期GUI任务基准。通过对最先进模型的广泛实验，我们发现即便最强模型也仅能达成略高于30%的成功率，凸显出专业长周期GUI工作流程对当前GUI代理而言仍极具挑战性。进一步分析表明，当前代理难以维持长周期工作流程的一致性，频繁出现工作流阶段缺失、错误传播、目标偏移及对专业软件环境理解不足等问题。我们的发现为揭示当前代理系统的局限性提供了重要见解，并为下一代GUI代理研究指明了关键方向。