Despite significant research effort in the development of automatic dialogue evaluation metrics, little thought is given to evaluating dialogues other than in English. At the same time, ensuring metrics are invariant to semantically similar responses is also an overlooked topic. In order to achieve the desired properties of robustness and multilinguality for dialogue evaluation metrics, we propose a novel framework that takes advantage of the strengths of current evaluation models with the newly-established paradigm of prompting Large Language Models (LLMs). Empirical results show our framework achieves state of the art results in terms of mean Spearman correlation scores across several benchmarks and ranks first place on both the Robust and Multilingual tasks of the DSTC11 Track 4 "Automatic Evaluation Metrics for Open-Domain Dialogue Systems", proving the evaluation capabilities of prompted LLMs.
翻译:尽管在自动对话评估指标开发方面投入了大量研究,但针对非英语对话的评估却鲜有考虑。同时,确保评估指标对语义相似响应具有不变性也是一个被忽视的课题。为在对话评估指标中实现鲁棒性和多语言性这两个期望特性,我们提出了一种创新框架,该框架结合了当前评估模型的优势与大型语言模型(LLMs)提示这一新兴范式。实验结果表明,我们的框架在多个基准测试的斯皮尔曼相关系数均值上达到最先进水平,并在DSTC11 Track 4“开放域对话系统自动评估指标”的鲁棒性和多语言性两项任务中均位列第一,证明了提示式LLM的评估能力。