INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback

Automatically evaluating the quality of language generation is critical. Although recent learned metrics show high correlation with human judgement, these metrics can not explain their verdict or associate the scores with defects in generated text. To address this limitation, we present InstructScore, an explainable evaluation metric for text generation. By harnessing both explicit human instruction and the implicit knowledge of GPT-4, we fine-tune a text evaluation metric based on LLaMA, producing both a score for generated text and a human readable diagnostic report. We evaluate InstructScore on a variety of generation tasks, including translation, captioning, data-to-text and commonsense generation. Experiments show that our 7B model surpasses all other unsupervised metrics, including those based on 175B GPT-3 and GPT-4. Surprisingly, our InstructScore, even without direct supervision from human-rated data, achieves performance levels on par with state-of-the-art metrics like COMET22, which were fine-tuned on human ratings.

翻译：自动评估语言生成质量至关重要。尽管近年来基于学习的评价指标与人类判断具有高度相关性，但这些指标无法解释其评判依据，也无法将评分与生成文本中的缺陷关联起来。为解决这一局限，我们提出了InstructScore——一种可解释的文本生成评价指标。通过结合显式的人类指令与GPT-4的隐式知识，我们基于LLaMA微调了一种文本评价指标，既能生成文本评分，又能输出人类可读的诊断报告。我们在包括翻译、字幕生成、数据到文本生成以及常识生成等多种生成任务上对InstructScore进行了评估。实验表明，我们的7B模型超越了所有其他无监督指标，包括基于175B GPT-3和GPT-4的指标。令人惊讶的是，即使缺乏人工评分数据的直接监督，我们的InstructScore也能达到与COMET22等基于人工评分微调的最先进指标相当的性能水平。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日