This review gives an extensive overview of evaluation methods for task-oriented dialogue systems, paying special attention to practical applications of dialogue systems, for example for customer service. The review (1) provides an overview of the used constructs and metrics in previous work, (2) discusses challenges in the context of dialogue system evaluation and (3) develops a research agenda for the future of dialogue system evaluation. We conducted a systematic review of four databases (ACL, ACM, IEEE and Web of Science), which after screening resulted in 122 studies. Those studies were carefully analysed for the constructs and methods they proposed for evaluation. We found a wide variety in both constructs and methods. Especially the operationalisation is not always clearly reported. Newer developments concerning large language models are discussed in two contexts: to power dialogue systems and to use in the evaluation process. We hope that future work will take a more critical approach to the operationalisation and specification of the used constructs. To work towards this aim, this review ends with recommendations for evaluation and suggestions for outstanding questions.
翻译:本综述全面梳理了面向任务对话系统的评估方法,特别关注对话系统在客户服务等实际应用场景中的实践。该综述(1)总结了已有研究中使用的构念和度量标准,(2)探讨了对话系统评估面临的挑战,以及(3)提出了对话系统评估未来的研究方向。我们对四个数据库(ACL、ACM、IEEE和Web of Science)进行了系统检索,筛选后共纳入122篇研究。我们仔细分析了这些研究提出的评估构念和方法,发现构念和方法存在显著多样性。尤其值得注意的是,操作化的过程在报告中往往不够明确。针对大型语言模型的最新发展,我们讨论了两类应用场景:驱动对话系统以及在评估过程中的应用。我们希望未来研究能够对所用构念的操作化和具体说明采取更加审慎的态度。为此,本综述最后提出了评估建议及待解决的关键问题。