Large language models (LLMs) have revolutionized the field of NLP. Notably, their in-context learning capabilities also enable their use as evaluation metrics for natural language generation, making them particularly advantageous in low-resource scenarios and time-restricted applications. In this work, we introduce PrExMe, a large-scale prompt exploration for metrics, where we evaluate more than 720 prompt templates for open-source LLM-based metrics on machine translation (MT) and summarization datasets, totalling over 6.6M evaluations. This extensive comparison (1) serves as a benchmark of the performance of recent open-source LLMs as metrics and (2) explores the stability and variability of different prompting strategies. We discover that, on the one hand, there are scenarios for which prompts are stable. For instance, some LLMs show idiosyncratic preferences and favor to grade generated texts with textual labels while others prefer to return numeric scores. On the other hand, the stability of prompts and model rankings can be susceptible to seemingly innocuous changes. For example, changing the requested output format from "0 to 100" to "-1 to +1" can strongly affect the rankings in our evaluation. Our study contributes to understanding the impact of different prompting approaches on LLM-based metrics for MT and summarization evaluation, highlighting the most stable prompting patterns and potential limitations.
翻译:大语言模型(LLMs)已彻底革新了自然语言处理领域。值得注意的是,其上下文学习能力也使其能够作为自然语言生成的评估指标,这在低资源场景和时间受限的应用中尤为有利。本研究提出了PrExMe——一个面向评估指标的大规模提示探索框架,我们在机器翻译(MT)和摘要数据集上对基于开源LLM的评估指标测试了超过720个提示模板,累计完成超过660万次评估。这一广泛比较(1)为近期开源LLM作为评估指标的性能提供了基准,(2)探究了不同提示策略的稳定性与变异性。我们发现,一方面存在提示策略表现稳定的场景。例如,某些LLM表现出特定偏好,倾向于使用文本标签对生成文本进行评分,而其他模型则偏好返回数值分数。另一方面,提示的稳定性与模型排名可能受到看似微小改动的影响。例如,将输出格式要求从“0到100”改为“-1到+1”会显著影响我们评估中的模型排名。本研究有助于理解不同提示方法对基于LLM的机器翻译与摘要评估指标的影响,揭示了最稳定的提示模式及潜在局限性。