Despite recent advances, evaluating how well large language models (LLMs) follow user instructions remains an open problem. While evaluation methods of language models have seen a rise in prompt-based approaches, limited work on the correctness of these methods has been conducted. In this work, we perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of LLMs. Our investigation is performed on grounded query-based summarization by collecting a new short-form, real-world dataset riSum, containing $300$ document-instruction pairs with $3$ answers each. All $900$ answers are rated by $3$ human annotators. Using riSum, we analyze agreement between evaluation methods and human judgment. Finally, we propose new LLM-based reference-free evaluation methods that improve upon established baselines and perform on-par with costly reference-based metrics which require high-quality summaries.
翻译:尽管近期取得了进展,如何有效评估大型语言模型遵循用户指令的能力仍是一个开放性问题。虽然基于提示的评价方法在语言模型评估中日益增多,但关于这些方法正确性的研究仍然有限。本研究对多种指标进行了元评估,以量化其测量语言模型指令跟随能力的准确性。我们的研究基于有依据的查询式摘要任务展开,通过收集包含300个文档-指令对(每个指令对含3个答案)的新型短篇真实世界数据集riSum进行实验。所有900个答案均由3名人类标注员评分。基于riSum数据集,我们分析了评估方法与人类判断之间的一致性。最后,我们提出了基于大型语言模型的无参考评估新方法,该方法在提升现有基准性能的同时,达到了需高质量摘要作为参考的高成本指标同等水平。