Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually be expressed in different forms, and the evaluation with a single or few references may not accurately reflect the quality of the model's hypotheses. To address this issue, this paper presents a novel method, named Para-Ref, to enhance existing evaluation benchmarks by enriching the number of references. We leverage large language models (LLMs) to paraphrase a single reference into multiple high-quality ones in diverse expressions. Experimental results on representative NLG tasks of machine translation, text summarization, and image caption demonstrate that our method can effectively improve the correlation with human evaluation for sixteen automatic evaluation metrics by +7.82% in ratio. We release the code and data at https://github.com/RUCAIBox/Para-Ref.
翻译:大多数关于自然语言生成的研究依赖于对每个样本仅有少量参考的评价基准,这可能导致与人工评判的相关性较差。其根本原因在于同一语义实际上可以用不同形式表达,而仅依赖一条或少数几条参考进行的评估可能无法准确反映模型生成假设的质量。为解决该问题,本文提出一种名为Para-Ref的新方法,通过增加参考数量来增强现有评价基准。我们利用大语言模型将单条参考释义为多条表达多样的高质量参考。在机器翻译、文本摘要和图像描述等代表性自然语言生成任务上的实验结果表明,该方法能有效提升十六种自动评估指标与人工评价的相关性,平均提升比例为7.82%。代码与数据已开源至https://github.com/RUCAIBox/Para-Ref。