Automatically evaluating the quality of language generation is critical. Although recent learned metrics show high correlation with human judgement, these metrics can not explain their verdict or associate the scores with defects in generated text. To address this limitation, we present InstructScore, an explainable evaluation metric for text generation. By harnessing both explicit human instruction and the implicit knowledge of GPT-4, we fine-tune a text evaluation metric based on LLaMA, producing both a score for generated text and a human readable diagnostic report. We evaluate InstructScore on a variety of generation tasks, including translation, captioning, data-to-text and commonsense generation. Experiments show that our 7B model surpasses all other unsupervised metrics, including those based on 175B GPT-3 and GPT-4. Surprisingly, our InstructScore, even without direct supervision from human-rated data, achieves performance levels on par with state-of-the-art metrics like COMET22, which were fine-tuned on human ratings.
翻译:自动评估语言生成质量至关重要。尽管近年来基于学习的评价指标与人类判断具有高度相关性,但这些指标无法解释其评判依据,也无法将评分与生成文本中的缺陷关联起来。为解决这一局限,我们提出了InstructScore——一种可解释的文本生成评价指标。通过结合显式的人类指令与GPT-4的隐式知识,我们基于LLaMA微调了一种文本评价指标,既能生成文本评分,又能输出人类可读的诊断报告。我们在包括翻译、字幕生成、数据到文本生成以及常识生成等多种生成任务上对InstructScore进行了评估。实验表明,我们的7B模型超越了所有其他无监督指标,包括基于175B GPT-3和GPT-4的指标。令人惊讶的是,即使缺乏人工评分数据的直接监督,我们的InstructScore也能达到与COMET22等基于人工评分微调的最先进指标相当的性能水平。