Benchmarking LLMs' Judgments with No Gold Standard

We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs), particularly in generating informative judgments, without the need for a gold standard reference. GEM broadens the scenarios where we can benchmark LLM generation performance-from traditional ones, like machine translation and summarization, where gold standard references are readily available, to subjective tasks without clear gold standards, such as academic peer review. GEM uses a generative model to estimate mutual information between candidate and reference responses, without requiring the reference to be a gold standard. In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner, and outperforms all other baselines. Additionally, GEM is more robust against strategic manipulations, such as rephrasing or elongation, which can artificially inflate scores under a GPT-4o Examiner. We also present GRE-bench (Generating Review Evaluation Benchmark) which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers. Because GRE-bench is based upon GEM, it inherits its robustness properties. Additionally, GRE-bench circumvents data contamination problems (or data leakage) by using the continuous influx of new open-access research papers and peer reviews each year. We show GRE-bench results of various popular LLMs on their peer review capabilities using the ICLR2023 dataset.

翻译：本文提出GEM（生成式互信息估计器），一种用于评估大语言模型语言生成能力的指标，特别适用于无需黄金标准参考的生成式判断任务。GEM将大语言模型生成性能的评估场景从传统领域（如机器翻译和摘要生成等易于获得黄金标准的任务）扩展到缺乏明确黄金标准的主观任务（如学术同行评审）。GEM通过生成模型估计候选回答与参考回答之间的互信息，且不要求参考回答必须为黄金标准。在人工标注数据集上的实验表明，相较于最先进的GPT-4o Examiner评估器，GEM与人工评分的相关性达到竞争性水平，并优于所有其他基线方法。此外，GEM对策略性操纵（如改写或延长文本）具有更强的鲁棒性，这类操作可能导致GPT-4o Examiner评估器产生虚高评分。我们还提出了GRE-bench（生成式评审评估基准），该基准通过评估大语言模型为学术研究论文生成高质量同行评审的能力来衡量其性能。由于GRE-bench基于GEM构建，因此继承了其鲁棒性特性。同时，GRE-bench通过利用每年持续新增的开放获取研究论文和同行评审数据，规避了数据污染（或数据泄露）问题。我们使用ICLR2023数据集展示了多种主流大语言模型在同行评审能力上的GRE-bench测试结果。