AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI

from arxiv, Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Multi-Turn Interactions in Large Language Models

Goal changes are a defining feature of real world multi-turn interactions, yet current agent benchmarks primarily evaluate static objectives or one-shot tool use. We introduce AgentChangeBench, a benchmark explicitly designed to measure how tool augmented language model agents adapt to mid dialogue goal shifts across three enterprise domains. Our framework formalizes evaluation through four complementary metrics: Task Success Rate (TSR) for effectiveness, Tool Use Efficiency (TUE) for reliability, Tool Call Redundancy Rate (TCRR) for wasted effort, and Goal-Shift Recovery Time (GSRT) for adaptation latency. AgentChangeBench comprises 2,835 task sequences and five user personas, each designed to trigger realistic shift points in ongoing workflows. Using this setup, we evaluate several frontier models and uncover sharp contrasts obscured by traditional $\text{pass}@k$ scores: for example, GPT-4o reaches $92.2\%$ recovery on airline booking shifts while Gemini collapses to $48.6\%$, and retail tasks show near perfect parameter validity yet redundancy rates above $80\%$, revealing major inefficiencies. These findings demonstrate that high raw accuracy does not imply robustness under dynamic goals, and that explicit measurement of recovery time and redundancy is essential. AgentChangeBench establishes a reproducible testbed for diagnosing and improving agent resilience in realistic enterprise settings.

翻译：目标转移是现实世界多轮交互的一个决定性特征，然而当前的智能体基准主要评估静态目标或一次性工具使用。我们提出了AgentChangeBench，这是一个专门用于衡量工具增强型语言模型智能体在三个企业领域中如何适应对话中途目标转移的基准。我们的框架通过四个互补的指标形式化评估：衡量有效性的任务成功率（TSR）、衡量可靠性的工具使用效率（TUE）、衡量无效努力的工具调用冗余率（TCRR），以及衡量适应延迟的目标转移恢复时间（GSRT）。AgentChangeBench包含2,835个任务序列和五个用户角色，每个设计用于在持续工作流中触发现实的目标转移点。利用此设置，我们评估了多个前沿模型，并揭示了被传统$\text{pass}@k$分数所掩盖的显著对比：例如，GPT-4o在机票预订转移任务上达到$92.2\%$的恢复率，而Gemini则骤降至$48.6\%$；零售任务显示出近乎完美的参数有效性，但冗余率却超过$80\%$，揭示了主要的效率低下问题。这些发现表明，高原始准确率并不意味着在动态目标下的鲁棒性，并且明确测量恢复时间和冗余率至关重要。AgentChangeBench为诊断和提升智能体在现实企业环境中的韧性建立了一个可复现的测试平台。