Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users' information needs and expectations in a company's target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.
翻译:评估多轮交互式智能体因需要人工评估而具有挑战性。使用模拟用户进行评估已被引入作为一种替代方案,然而现有方法通常建模通用用户,忽视了捕捉真实行为所需的领域特定原则。我们提出了SAGE,一种用于多轮智能体评估的新型用户模拟框架,该框架整合了来自业务背景的知识。SAGE融入了根植于业务逻辑的自顶向下知识,例如理想客户画像,将用户行为锚定在真实的客户角色中。我们进一步整合了取自业务智能体基础设施(例如产品目录、常见问题解答和知识库)的自底向上知识,使模拟器能够生成反映公司目标市场中用户信息需求和期望的交互。通过实证评估,我们发现该方法产生的交互更加真实和多样,同时还能多识别出高达33%的智能体错误,突显了其作为支持错误发现和智能体迭代改进的有效评估工具。