Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.
翻译:自自然语言处理(NLP)领域开始利用大语言模型(LLM)作为评估生成文本质量的评判者以来,现有研究大多基于GPT-4直接提示标注的评估数据训练评述生成模型。我们观察到,这些模型在点式评分和成对比较(尤其在缺乏参考文本时)中均缺乏生成信息性评述的能力。因此,其生成的评述无法对生成文本提供细粒度区分度,导致评估性能不尽如人意。本文提出一种简单而有效的方法Eval-Instruct:该方法首先通过伪参考获取点式评分评述,继而通过多路径提示修订这些评述,从而获得涵盖不同任务与设置(包括含/无参考文本的点式评分与成对比较)的信息化评估数据。基于这些数据微调得到的模型CritiqueLLM,实证表明其评估性能超越ChatGPT及所有开源基线模型,在点式评分的系统级相关性指标上甚至达到与GPT-4相当的水平。我们进一步证明,模型生成的评述可作为可扩展的反馈信息,持续提升如ChatGPT等强语言模型的生成质量。