We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and troubleshooting. To ensure realism, SCUBA operates in Salesforce sandbox environments with support for parallel execution and fine-grained evaluation metrics to capture milestone progress. We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings. We observed huge performance gaps in different agent design paradigms and gaps between the open-source model and the closed-source model. In the zero-shot setting, open-source model powered computer-use agents that have strong performance on related benchmarks like OSWorld only have less than 5\% success rate on SCUBA, while methods built on closed-source models can still have up to 39% task success rate. In the demonstration-augmented settings, task success rates can be improved to 50\% while simultaneously reducing time and costs by 13% and 16%, respectively. These findings highlight both the challenges of enterprise tasks automation and the promise of agentic solutions. By offering a realistic benchmark with interpretable evaluation, SCUBA aims to accelerate progress in building reliable computer-use agents for complex business software ecosystems.
翻译:我们推出SCUBA基准测试,旨在评估Salesforce平台内客户关系管理(CRM)工作流程的计算机使用智能体。SCUBA包含源自真实用户访谈的300个任务实例,涵盖平台管理员、销售代表和服务专员三大核心角色。这些任务测试一系列企业关键能力,包括企业软件UI导航、数据操作、工作流自动化、信息检索和故障排除。为确保真实性,SCUBA在Salesforce沙盒环境中运行,支持并行执行和细粒度评估指标以追踪里程碑进展。我们在零样本和演示增强两种设置下对多种智能体进行基准测试。我们观察到不同智能体设计范式之间存在巨大性能差距,以及开源模型与闭源模型之间的显著差异。在零样本设置中,在OSWorld等相关基准测试表现优异的开源模型驱动的计算机使用智能体,在SCUBA上的成功率不足5%,而基于闭源模型的方法仍能达到39%的任务成功率。在演示增强设置中,任务成功率可提升至50%,同时分别降低13%的时间和16%的成本。这些发现既揭示了企业任务自动化的挑战,也展现了智能体解决方案的潜力。通过提供具有可解释性评估的现实基准,SCUBA旨在加速构建适用于复杂商业软件生态系统的可靠计算机使用智能体。