N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. However, recent studies have revealed a weak correlation between these matching-based metrics and human evaluations, especially when compared with neural-based metrics like BLEURT. In this paper, we conjecture that the performance bottleneck in matching-based metrics may be caused by the limited diversity of references. To address this issue, we propose to utilize \textit{multiple references} to enhance the consistency between these metrics and human evaluations. Within the WMT Metrics benchmarks, we observe that the multi-references F200spBLEU surpasses the conventional single-reference one by an accuracy improvement of 7.2\%. Remarkably, it also exceeds the neural-based BERTscore by an accuracy enhancement of 3.9\%. Moreover, we observe that the data leakage issue in large language models (LLMs) can be mitigated to a large extent by our multi-reference metric. We release the code and data at \url{https://github.com/SefaZeng/LLM-Ref}
翻译:基于N-gram匹配的评估指标,如BLEU和chrF,广泛应用于各类自然语言生成(NLG)任务中。然而,近期研究表明,这些基于匹配的指标与人类评估之间的相关性较弱,尤其是在与神经指标(如BLEURT)对比时。本文推测,匹配指标的性能瓶颈可能源于参考文本多样性不足。为解决这一问题,我们提出利用多参考文本来增强这些指标与人类评估的一致性。在WMT指标基准测试中,我们观察到多参考F200spBLEU相比传统单参考指标准确率提升了7.2%。值得注意的是,它相较于神经指标BERTscore也实现了3.9%的准确率提升。此外,我们发现在大语言模型(LLMs)中的数据泄露问题可通过我们的多参考指标在很大程度上得到缓解。相关代码与数据已发布在\url{https://github.com/SefaZeng/LLM-Ref}。