Large language models (LLMs) have revolutionized NLP research. Notably, in-context learning enables their use as evaluation metrics for natural language generation, making them particularly advantageous in low-resource scenarios and time-restricted applications. In this work, we introduce PrExMe, a large-scale Prompt Exploration for Metrics, where we evaluate more than 720 prompt templates for open-source LLM-based metrics on machine translation (MT) and summarization datasets, totalling over 6.6M evaluations. This extensive comparison (1) benchmarks recent open-source LLMs as metrics and (2) explores the stability and variability of different prompting strategies. We discover that, on the one hand, there are scenarios for which prompts are stable. For instance, some LLMs show idiosyncratic preferences and favor to grade generated texts with textual labels while others prefer to return numeric scores. On the other hand, the stability of prompts and model rankings can be susceptible to seemingly innocuous changes. For example, changing the requested output format from "0 to 100" to "-1 to +1" can strongly affect the rankings in our evaluation. Our study contributes to understanding the impact of different prompting approaches on LLM-based metrics for MT and summarization evaluation, highlighting the most stable prompting patterns and potential limitations.
翻译:大语言模型(LLMs)已彻底变革了自然语言处理研究领域。其中,上下文学习能力使其可作为自然语言生成任务的评估指标,在低资源场景和时间受限的应用中尤具优势。本研究提出了PrExMe——一项面向评估指标的大规模提示探索,我们在机器翻译(MT)与文本摘要数据集上对超过720个开源LLM评估指标的提示模板进行了评测,累计完成超过660万次评估。此项大规模对比研究(1)为近期开源LLM作为评估指标建立了性能基准,(2)深入探究了不同提示策略的稳定性与变异性。研究发现:一方面,存在提示策略表现稳定的场景。例如,部分LLM展现出特定偏好,倾向于使用文本标签对生成文本进行评分,而其他模型则更偏好返回数值分数。另一方面,提示的稳定性与模型排名可能受到看似微小的改动影响。例如,将输出格式要求从“0到100”改为“-1到+1”会显著改变评估中的模型排名。本研究有助于理解不同提示方法对基于LLM的机器翻译与摘要评估指标的影响,揭示了最稳定的提示模式及其潜在局限性。