EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents

Recent advances in Multimodal Large Language Models (MLLMs) have enabled agents to operate in open-ended web and operating system environments. However, existing benchmarks predominantly target consumer-oriented scenarios (e.g., e-commerce and travel booking), failing to capture the complexity and rigor of professional enterprise workflows. Enterprise systems pose distinct challenges, including high-density user interfaces, strict business logic constraints, and a strong reliance on precise, state-consistent information retrieval-settings in which current generalist agents often struggle. To address this gap, we introduce EntWorld, a large-scale benchmark consisting of 1,756 tasks across six representative enterprise domains, including customer relationship management (CRM), information technology infrastructure library (ITIL), and enterprise resource planning (ERP) systems. Unlike previous datasets that depend on fragile execution traces or extensive manual annotation, EntWorld adopts a schema-grounded task generation framework that directly reverse-engineers business logic from underlying database schemas, enabling the synthesis of realistic, long-horizon workflows. Moreover, we propose a SQL-based deterministic verification mechanism in building datasets that replaces ambiguous visual matching with rigorous state-transition validation. Experimental results demonstrate that state-of-the-art models (e.g., GPT-4.1) achieve 47.61% success rate on EntWorld, substantially lower than the human performance, highlighting a pronounced enterprise gap in current agentic capabilities and the necessity of developing domain-specific agents. We release EntWorld as a rigorous testbed to facilitate the development and evaluation of the next generation of enterprise-ready digital agents.

翻译：近年来，多模态大语言模型（MLLMs）的进展使得智能代理能够在开放式的网页和操作系统环境中运行。然而，现有的基准测试主要面向消费级场景（例如电子商务和旅行预订），未能捕捉专业企业工作流程的复杂性和严谨性。企业系统提出了独特的挑战，包括高密度用户界面、严格的业务逻辑约束以及对精确、状态一致的信息检索的强烈依赖——这些场景中，当前的通才代理往往表现不佳。为弥补这一差距，我们推出了EntWorld，这是一个大规模基准测试，包含六个代表性企业领域（包括客户关系管理（CRM）、信息技术基础设施库（ITIL）和企业资源规划（ERP）系统）的1,756项任务。与以往依赖脆弱执行轨迹或大量人工标注的数据集不同，EntWorld采用了一种基于模式的任务生成框架，该框架直接从底层数据库模式逆向工程出业务逻辑，从而能够合成真实、长视野的工作流程。此外，我们提出了一种基于SQL的确定性验证机制来构建数据集，该机制用严格的状态转换验证取代了模糊的视觉匹配。实验结果表明，最先进的模型（例如GPT-4.1）在EntWorld上的成功率仅为47.61%，远低于人类表现，突显了当前代理能力在企业领域存在显著差距，以及开发领域特定代理的必要性。我们发布EntWorld作为一个严谨的测试平台，以促进下一代企业级数字代理的开发和评估。