Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework

Aman Gupta,Kevin Rossell,Edesio Alcobaça,Jose Chrystian Lima Pacheco,Carolina Baptista de Lima,Shao Tang,Luiz Paulo Rabachini,Luis Moneda,Herbert Fei,Daniel Silva,Rohan Ramanath

from arxiv, 12 pages. Accepted to KDD '26 (32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining)

The rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement. Yet these critical pillars are typically developed in isolation, creating blind spots that only surface after deployment. In this paper, we present a unified framework that bridges offline development with online impact for customer support AI agents at Nubank, a company with 100M+ users. Our approach integrates several key components: (1) structured context engineering tailored to customer support agents, (2) systematic human-in-the-loop prompt iteration, (3) rigorous LLM judge evaluation with measured inter-rater agreement and GEPA optimization for consistency, and (4) ideation-to-production validation. A central insight is that evaluation-pipeline quality directly determines iteration velocity. We present results from five production deployments spanning distinct domains: card delivery, debt management, credit-limit support, card management, and product explanation. These deployments deliver consistent customer-satisfaction gains while substantially accelerating iteration. In our card-delivery deployment, large-scale A/B testing yields a 37 percentage-point improvement in AI transactional Net Promoter Score and a 29 percentage-point gain in self-service rate over prior agent variants, alongside a strong correlation between offline simulation metrics and online outcomes, demonstrating that eval-driven development reliably predicts production impact. On most use cases, AI satisfaction reaches within a few percentage points of expert human agents.

翻译：大语言模型能力的快速提升使得AI代理在广泛任务中日益可行。其中最具前景的应用之一是构建面向客户的生产级代理，这一挑战要求在评估方法论、上下文工程、训练和在线测量方面实现协调卓越。然而，这些关键支柱通常孤立发展，导致部署后才暴露的盲区。本文提出一个统一框架，将离线开发与在线影响相衔接，针对拥有超过1亿用户的Nubank公司的客服AI代理。我们的方法整合了多个关键组件：（1）针对客服代理的结构化上下文工程，（2）系统性人机协同提示迭代，（3）具有测量评估者间一致性和GEPA一致性优化的严格大语言模型裁判评估，以及（4）从构思到生产的验证。核心洞察在于评估流水线质量直接决定迭代速度。我们展示了涵盖不同领域的五次生产部署结果：卡片配送、债务管理、信用额度支持、卡片管理和产品说明。这些部署在显著加速迭代的同时，带来持续的客户满意度提升。在卡片配送部署中，大规模A/B测试显示，相较于先前代理版本，AI交易净推荐值提升37个百分点，自助服务率提升29个百分点，同时离线模拟指标与在线结果高度相关，证明评估驱动开发可靠预测生产影响。在大多数用例中，AI满意度与人类专家代理的差距仅几个百分点。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

代码即代理基础设施：迈向可执行、可验证、有状态的AI代理系统

专知会员服务

17+阅读 · 5月20日

构建面向终端的 AI 编程智能体：脚手架、测试环境、上下文工程及实践经验

专知会员服务

25+阅读 · 3月8日

AI 智能体系统：体系架构、应用场景及评估范式

专知会员服务

70+阅读 · 1月6日

《代理型人工智能全面指南》，45页ppt

专知会员服务

63+阅读 · 2025年2月12日