Automatic evaluation metrics for generated texts play an important role in the NLG field, especially with the rapid growth of LLMs. However, existing metrics are often limited to specific scenarios, making it challenging to meet the evaluation requirements of expanding LLM applications. Therefore, there is a demand for new, flexible, and effective metrics. In this study, we introduce RepEval, the first metric leveraging the projection of LLM representations for evaluation. RepEval requires minimal sample pairs for training, and through simple prompt modifications, it can easily transition to various tasks. Results on ten datasets from three tasks demonstrate the high effectiveness of our method, which exhibits stronger correlations with human judgments compared to previous metrics, even outperforming GPT-4. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
翻译:生成文本的自动评估指标在自然语言生成领域发挥着重要作用,尤其是在大语言模型快速发展的背景下。然而,现有指标往往受限于特定场景,难以满足不断扩展的大语言模型应用的评估需求。因此,亟需开发新型、灵活且高效的评估指标。本研究提出RepEval,这是首个利用大语言模型表示投影进行评估的指标。RepEval仅需极少量样本对进行训练,通过简单的提示修改即可轻松迁移至各类任务。在三项任务的十个数据集上的结果表明,该方法具有高度有效性,与人类判断的相关性显著强于既往指标,甚至优于GPT-4。本研究揭示了大语言模型表示中蕴含的丰富文本质量信息,为新型评估指标的开发提供了重要启示。