Customer-service LLM agents increasingly make policy-bound decisions (refunds, rebooking, billing disputes), but the same ``helpful'' interaction style can be exploited: a small fraction of users can induce unauthorized concessions, shifting costs to others and eroding trust in agentic workflows. We present a cross-domain benchmark of profit-seeking direct prompt injection in customer-service interactions, spanning 10 service domains and 100 realistic attack scripts grouped into five technique families. Across five widely used models under a unified rubric with uncertainty reporting, attacks are highly domain-dependent (airline support is most exploitable) and technique-dependent (payload splitting is most consistently effective). We release data and evaluation code to support reproducible auditing and to inform the design of oversight and recovery workflows for trustworthy, human centered agent interfaces.
翻译:客户服务大语言模型代理越来越多地做出政策约束性决策(如退款、改签、账单争议),但其“乐于助人”的交互方式可能被恶意利用:少数用户能够诱导模型做出未授权的让步,将成本转嫁给其他用户,并削弱人们对代理工作流的信任。本文提出了一个跨领域基准测试,用于评估客户服务交互中逐利性直接提示注入攻击的效果,涵盖10个服务领域和100个真实攻击脚本,这些脚本被归纳为五种技术类型。在采用统一评估框架及不确定性报告机制的五种常用模型中,攻击效果呈现显著的领域依赖性(航空客服最易受攻击)和技术依赖性(载荷分割技术持续表现最优)。我们公开了数据集与评估代码,以支持可复现的审计工作,并为设计可信赖、以人为本的代理交互界面的监督与恢复流程提供参考。