Despite recent advances, evaluating how well large language models (LLMs) follow user instructions remains an open problem. While evaluation methods of language models have seen a rise in prompt-based approaches, limited work on the correctness of these methods has been conducted. In this work, we perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of LLMs. Our investigation is performed on grounded query-based summarization by collecting a new short-form, real-world dataset riSum, containing 300 document-instruction pairs with 3 answers each. All 900 answers are rated by 3 human annotators. Using riSum, we analyze the agreement between evaluation methods and human judgment. Finally, we propose new LLM-based reference-free evaluation methods that improve upon established baselines and perform on par with costly reference-based metrics that require high-quality summaries.
翻译:尽管近期取得了进展,评估大语言模型(LLM)遵循用户指令的能力仍是一个未解决的问题。虽然基于提示的方法在语言模型评估中日益增多,但针对这些方法正确性的研究仍然有限。在本工作中,我们对多种指标进行了元评估,以量化其测量LLM指令跟随能力的准确性。我们的研究基于有依据的查询式摘要任务,通过收集一个新的短文本真实世界数据集riSum(包含300对文档-指令组合,每对附有3个答案),所有900个答案均由3名人工标注者评分。利用riSum,我们分析了评估方法与人工判断之间的一致性。最后,我们提出了基于LLM的无参考评估新方法,这些方法在已有基准上有所改进,其表现与需要高质量摘要的昂贵的参考式指标相当。