As Large Language Models transition to autonomous agents, user inputs frequently violate cooperative assumptions (e.g., implicit intent, missing parameters, false presuppositions, or ambiguous expressions), creating execution risks that text-only evaluations do not capture. Existing benchmarks typically assume well-specified instructions or restrict evaluation to text-only, single-turn clarification, and thus do not measure multi-turn disambiguation under grounded execution risk. We introduce \textbf{Drift-Bench}, the first diagnostic benchmark that evaluates agentic pragmatics under input faults through multi-turn clarification across state-oriented and service-oriented execution environments. Grounded in classical theories of communication, \textbf{Drift-Bench} provides a unified taxonomy of cooperative breakdowns and employs a persona-driven user simulator with the \textbf{Rise} evaluation protocol. Experiments show substantial performance drops under these faults, with clarification effectiveness varying across user personas and fault types. \MethodName bridges clarification research and agent safety evaluation, enabling systematic diagnosis of failures that can lead to unsafe executions.
翻译:随着大语言模型向自主智能体演进,用户输入常常违反协作假设(例如隐含意图、参数缺失、错误预设或模糊表达),由此产生的执行风险是纯文本评估无法捕捉的。现有基准通常假设指令已充分明确,或将评估限制在纯文本、单轮澄清场景,因而无法衡量具身执行风险下的多轮消歧能力。我们提出\textbf{Drift-Bench}——首个通过状态导向与服务导向执行环境中的多轮澄清对话,系统评估输入故障下智能体语用能力的诊断基准。该基准植根于经典通信理论,构建了统一的协作失效分类体系,并采用角色驱动的用户模拟器与\textbf{Rise}评估协议。实验表明,在这些故障条件下模型性能出现显著下降,且澄清效果随用户角色与故障类型呈现差异化表现。本方法在澄清研究与智能体安全评估之间架设了桥梁,为系统诊断可能导致不安全执行的故障提供了方法论基础。