CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

Yi Yu,Guangquan Hu,Chenghuang Shen,Xingyan Liu,Jing Gu,Hangyi Sun,Junzhuo Ma,Weiting Liu,Jianfeng Liu,Mingyue Pu,Yu Wang,Zhengdong Xiao,Rui Xie,Longjiu Luo,Qianrong Wang,Gurong Cui,Honglin Qiao,Wenlian Lu

from arxiv, Submitted for SIGKDD 2026

The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving beyond execution correctness, we introduce novel Customer-Centric metrics to define agent success, quantifying service quality through metrics such as the Normalized Efficiency Index and Multi-Turn Latency to explicitly measure resolution efficiency. Experiments utilizing our framework reveal that while state-of-the-art models demonstrate strong reasoning capabilities, they frequently struggle in complex, realistic multi-turn tasks and fail to meet the high-efficiency standards required for customer service, highlighting critical directions for the future development of LLM-based agents in practical technical service applications. CirrusBench evaluation framework is released at: https://github.com/CirrusAI

翻译：大型语言模型(LLM)日益增强的智能体能力使其能够部署于云服务等现实应用场景，其中客户-助手交互展现出高度技术复杂性和长期依赖关系，使得鲁棒性和解决效率对客户满意度至关重要。然而，现有基于LLM的智能体评估基准大多依赖合成环境，未能捕捉真实客户输入的多样性与不可预测性，且常忽略实际部署中关键的分辨效率。为弥补这一差距，我们提出CirrusBench——一种基于真实云服务工单数据的新型评估框架。该框架保留了技术服务环境固有的复杂多轮逻辑链与真实工具依赖关系。超越执行正确性，我们引入新型客户中心化指标来定义智能体成功，通过归一化效率指数与多轮延迟等指标显式量化服务效率。利用本框架的实验表明，尽管最先进模型展现出强大的推理能力，但在复杂现实多轮任务中常表现不佳，且难以达到客户服务所需的高效标准，这为基于LLM的智能体在实际技术服务应用中的未来发展指明了关键方向。CirrusBench评估框架已发布于：https://github.com/CirrusAI