Customer Relationship Management (CRM) systems are vital for modern enterprises, providing a foundation for managing customer interactions and data. Integrating AI agents into CRM systems can automate routine processes and enhance personalized service. However, deploying and evaluating these agents is challenging due to the lack of realistic benchmarks that reflect the complexity of real-world CRM tasks. To address this issue, we introduce CRMArena, a novel benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments. Following guidance from CRM experts and industry best practices, we designed CRMArena with nine customer service tasks distributed across three personas: service agent, analyst, and manager. The benchmark includes 16 commonly used industrial objects (e.g., account, order, knowledge article, case) with high interconnectivity, along with latent variables (e.g., complaint habits, policy violations) to simulate realistic data distributions. Experimental results reveal that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities. Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments. CRMArena is an open challenge to the community: systems that can reliably complete tasks showcase direct business value in a popular work environment.
翻译:客户关系管理(CRM)系统是现代企业的重要支柱,为管理客户交互与数据提供了基础。将AI智能体集成至CRM系统可自动化常规流程并提升个性化服务水平。然而,由于缺乏能反映现实世界CRM任务复杂性的真实基准,这些智能体的部署与评估面临挑战。为解决此问题,我们提出了CRMArena——一个基于专业工作环境真实任务来评估AI智能体的新型基准。遵循CRM专家指导与行业最佳实践,我们设计了涵盖服务专员、分析师及管理者三种角色的九项客户服务任务。该基准包含16种高互联性的常用工业对象(如客户账户、订单、知识文章、案例),并引入潜在变量(如投诉习惯、政策违规)以模拟真实数据分布。实验结果表明,当前最先进的LLM智能体在使用ReAct提示策略时任务成功率不足40%,即使具备函数调用能力成功率仍低于55%。我们的研究揭示了智能体需在函数调用与规则遵循方面提升能力,才能在实际工作环境中部署。CRMArena向学界提出开放挑战:能够可靠完成任务的系统将在这一广泛应用的工作环境中展现直接商业价值。