Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for practitioners with large-scale evaluation tasks and custom criteria in consideration (e.g., child-readability), using proprietary LLMs as an evaluator is unreliable due to the closed-source nature, uncontrolled versioning, and prohibitive costs. In this work, we propose Prometheus, a fully open-source LLM that is on par with GPT-4's evaluation capabilities when the appropriate reference materials (reference answer, score rubric) are accompanied. We first construct the Feedback Collection, a new dataset that consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4. Using the Feedback Collection, we train Prometheus, a 13B evaluator LLM that can assess any given long-form text based on customized score rubric provided by the user. Experimental results show that Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics, which is on par with GPT-4 (0.882), and greatly outperforms ChatGPT (0.392). Furthermore, measuring correlation with GPT-4 with 1222 customized score rubrics across four benchmarks (MT Bench, Vicuna Bench, Feedback Bench, Flask Eval) shows similar trends, bolstering Prometheus's capability as an evaluator LLM. Lastly, Prometheus achieves the highest accuracy on two human preference benchmarks (HHH Alignment & MT Bench Human Judgment) compared to open-sourced reward models explicitly trained on human preference datasets, highlighting its potential as an universal reward model. We open-source our code, dataset, and model at https://github.com/kaistAI/Prometheus.

翻译：近期，使用强大的专有大型语言模型（LLM）（如GPT-4）作为长文本响应的评估器已成为事实标准。然而，对于需要承担大规模评估任务并考虑自定义标准（例如儿童可读性）的实践者而言，由于专有LLM存在闭源特性、版本控制不可控以及高昂成本等问题，将其用作评估器并不可靠。在本研究中，我们提出Prometheus——一个完全开源的LLM，在提供适当参考材料（参考答案、评分准则）时，其评估能力可与GPT-4相媲美。我们首先构建了反馈数据集（Feedback Collection），该新数据集包含1,000条细粒度评分准则、20,000条指令以及由GPT-4生成的100,000条响应与语言反馈。利用该反馈数据集，我们训练了Prometheus——一个130亿参数的评估器LLM，能够基于用户提供的自定义评分准则评估任意长文本。实验结果表明，在使用45条自定义评分准则进行评估时，Prometheus与人类评估者的皮尔逊相关系数达0.897，与GPT-4（0.882）持平，并大幅超过ChatGPT（0.392）。此外，在四个基准测试（MT Bench、Vicuna Bench、Feedback Bench、Flask Eval）中，使用1,222条自定义评分准则测量其与GPT-4的相关性也呈现相似趋势，进一步证实了Prometheus作为评估器LLM的能力。最后，在两个人机偏好基准测试（HHH Alignment与MT Bench Human Judgment）中，Prometheus相较于专门在人类偏好数据集上训练的开源奖励模型取得了最高准确率，突显其作为通用奖励模型的潜力。我们在https://github.com/kaistAI/Prometheus开源了代码、数据集与模型。