Recent advancements in reference-free learned metrics for open-domain dialogue evaluation have been driven by the progress in pre-trained language models and the availability of dialogue data with high-quality human annotations. However, current studies predominantly concentrate on English dialogues, and the generalization of these metrics to other languages has not been fully examined. This is largely due to the absence of a multilingual dialogue evaluation benchmark. To address the issue, we introduce xDial-Eval, built on top of open-source English dialogue evaluation datasets. xDial-Eval includes 12 turn-level and 6 dialogue-level English datasets, comprising 14930 annotated turns and 8691 annotated dialogues respectively. The English dialogue data are extended to nine other languages with commercial machine translation systems. On xDial-Eval, we conduct comprehensive analyses of previous BERT-based metrics and the recently-emerged large language models. Lastly, we establish strong self-supervised and multilingual baselines. In terms of average Pearson correlations over all datasets and languages, the best baseline outperforms OpenAI's ChatGPT by absolute improvements of 6.5% and 4.6% at the turn and dialogue levels respectively, albeit with much fewer parameters. The data and code are publicly available at https://github.com/e0397123/xDial-Eval.
翻译:在开放域对话评估中,基于预训练语言模型和高精度人工标注对话数据的无参考学习指标取得了近期进展。然而,当前研究主要集中于英语对话,这些指标在其他语言上的泛化能力尚未得到充分检验,这主要源于缺乏多语言对话评估基准。为解决此问题,我们提出xDial-Eval基准,该基准基于开源英语对话评估数据集构建。xDial-Eval包含12个轮次级和6个对话级英语数据集,分别包含14930个标注轮次和8691个标注对话。通过商用机器翻译系统,这些英语对话数据被扩展至其他九种语言。在xDial-Eval上,我们对基于BERT的传统指标和近期兴起的大语言模型进行了全面分析。最后,我们建立了强大的自监督和多语言基线。在所有数据集和语言的皮尔逊相关系数平均值方面,最佳基线在轮次级和对话级上分别比OpenAI的ChatGPT实现了6.5%和4.6%的绝对提升,尽管其参数量远小于ChatGPT。数据和代码已开源在https://github.com/e0397123/xDial-Eval。