This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. Using a PRISMA-inspired framework, we systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication, and establishing a solid foundation for our analysis. Our study offers a structured approach by developing two interrelated taxonomy systems: one that defines \emph{what to evaluate} and another that explains \emph{how to evaluate}. The first taxonomy identifies key components of LLM-based agents for multi-turn conversations and their evaluation dimensions, including task completion, response quality, user experience, memory and context retention, as well as planning and tool integration. These components ensure that the performance of conversational agents is assessed in a holistic and meaningful manner. The second taxonomy system focuses on the evaluation methodologies. It categorizes approaches into annotation-based evaluations, automated metrics, hybrid strategies that combine human assessments with quantitative measures, and self-judging methods utilizing LLMs. This framework not only captures traditional metrics derived from language understanding, such as BLEU and ROUGE scores, but also incorporates advanced techniques that reflect the dynamic, interactive nature of multi-turn dialogues.
翻译:本综述系统考察了基于大语言模型(LLM)的智能体在多轮对话场景下的评估方法。我们采用受PRISMA启发的框架,系统性地回顾了近250篇学术文献,梳理了来自不同出版渠道的最新研究成果,为分析奠定了坚实基础。本研究通过构建两个相互关联的分类体系,提出了一种结构化的研究路径:一个体系界定\emph{评估内容},另一个体系阐述\emph{评估方法}。第一个分类体系明确了多轮对话中基于LLM的智能体的关键构成要素及其评估维度,包括任务完成度、响应质量、用户体验、记忆与上下文保持能力,以及规划与工具整合能力。这些要素确保对话智能体的性能评估具备全面性和实际意义。第二个分类体系聚焦于评估方法论,将现有方法归纳为基于标注的评估、自动化指标、结合人工评估与量化测量的混合策略,以及利用LLM进行自我评判的方法。该框架不仅涵盖了源自语言理解的传统度量标准(如BLEU和ROUGE分数),还纳入了能够反映多轮对话动态交互特性的先进技术。