Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.
翻译:传统的客户支持系统,如交互式语音应答系统,依赖于僵化的脚本,缺乏处理复杂的、策略驱动型任务所需的灵活性。尽管大型语言模型智能体提供了一种有前景的替代方案,但评估其依据业务规则和现实世界支持工作流程行事的能力仍然是一个开放的挑战。现有基准测试主要关注工具使用或任务完成,而忽视了智能体遵循多步骤策略、处理任务依赖关系以及对不可预测的用户或环境行为保持鲁棒性的能力。在本研究中,我们提出了JourneyBench,一个旨在评估客户支持场景中策略感知型智能体的基准测试。JourneyBench利用图表示来生成多样化、真实的支持场景,并提出了用户旅程覆盖率这一新颖指标来衡量策略遵从性。我们使用两种智能体设计评估了多个最先进的大型语言模型:静态提示智能体和显式建模策略控制的动态提示智能体。在三个领域的703次对话中,我们的实验表明,动态提示智能体显著提升了策略遵从性,甚至使得像GPT-4o-mini这样较小的模型能够超越像GPT-4o这样能力更强的模型。我们的研究结果证明了结构化编排的重要性,并将JourneyBench确立为推动人工智能驱动的客户支持超越IVR时代局限性的关键资源。