Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs). Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue Systems", proving the evaluation capabilities of prompted LLMs.
翻译:尽管自动对话评估指标的研究投入了大量精力,但非英语对话的评估问题鲜受关注。同时,确保评估指标对语义相似回复具有不变性同样是一个被忽视的课题。为了实现对话评估指标所需的鲁棒性和多语言特性,我们提出了一种新颖框架,该框架充分利用当前评估模型的优势,并结合了新兴的大语言模型(LLM)提示范式。实证结果表明,我们的框架在多个基准测试中的平均斯皮尔曼相关系数上达到最优水平,并在DSTC11 Track 4“开放域对话系统自动评估指标”的鲁棒性和多语言任务中均位列第一,验证了基于提示的LLM的评估能力。