Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent's unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.
翻译:评估基于大语言模型的智能体仍具挑战性,因为识别有意义的失败案例通常需要大量人工设计真实测试场景。先前研究主要关注自动发现对抗性用户引发的智能体失败,而忽略了具有真实用户意图同样会触发智能体失败的问题。我们提出PQR框架,该框架不仅能针对特定目标(如有用性、安全性等)揭示智能体失败,还能模拟真实用户意图。PQR通过两个互补模块的迭代交互运作:查询细化模块执行重写以探索多样化查询变体,而提示细化模块利用先前反馈推导新的违反目标策略和现实性政策来精炼提示,进而生成既触发失败又具真实性的查询。我们在电商问答智能体的检测中评估了PQR识别无帮助响应的能力。相比现有方法,我们的方法多发现了23%至78%的无帮助响应,且生成的查询更具多样性和真实性。