MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments

Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. We introduce MobileWorld, a substantially more challenging benchmark designed to reflect real-world usage through 201 tasks across 20 applications. MobileWorld derives its difficulty from an emphasis on long-horizon, cross-application workflows, requiring nearly twice as many completion steps on average (27.8 vs. 14.3) and featuring a significantly higher proportion of multi-app tasks (62.2% vs. 9.5%) than AndroidWorld. To overcome the limitations of existing environments, MobileWorld achieves a balance between production-grade utility and reproducible evaluation by utilizing open-source alternatives to industry standards (e.g., Mattermost for Slack). This approach enables a fully observable and controlled environment through source code modification and direct backend database access for precise verification. MobileWorld also introduces novel task categories, including agent-user interaction and Model Context Protocol (MCP)-augmented tasks, for evaluating agents in user-aware, hybrid-tool scenarios. To facilitate evaluation, we develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively, highlighting ample headroom for future research.

翻译：在现有的在线移动应用使用基准中，AndroidWorld 因其可复现的环境和确定性评估已成为主导性基准；然而，近期智能体成功率超过90%表明其已趋于饱和，这促使我们需要一个更具挑战性的基准。此外，其环境缺乏关键应用类别，例如电子商务和企业通信，并且未能反映以模糊用户指令和混合工具使用为特征的真实移动使用场景。我们推出了 MobileWorld，这是一个通过涵盖20个应用程序的201项任务来反映真实世界使用情况的、挑战性显著更高的基准。MobileWorld 的难度源于其对长视野、跨应用工作流的强调，其平均完成步骤数（27.8 步 vs. 14.3 步）接近 AndroidWorld 的两倍，且多应用任务的比例（62.2% vs. 9.5%）显著更高。为了克服现有环境的局限，MobileWorld 通过采用行业标准的开源替代方案（例如，使用 Mattermost 替代 Slack），在生产级实用性与可复现评估之间取得了平衡。这种方法通过修改源代码和直接访问后端数据库进行精确验证，实现了完全可观测和可控的环境。MobileWorld 还引入了新的任务类别，包括智能体-用户交互和模型上下文协议（MCP）增强任务，用于在用户感知、混合工具场景下评估智能体。为了便于评估，我们开发了一个具有扩展动作空间的规划器-执行器智能体框架，以支持用户交互和 MCP 调用。我们的结果显示，与 AndroidWorld 相比，性能出现急剧下降，最佳智能体框架和端到端模型分别仅达到 51.7% 和 20.9% 的成功率，这突显了未来研究存在广阔的提升空间。