As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
翻译:随着大语言模型(LLM)的不断发展,其能力已超越我们对其质量进行精确评估的能力。不仅寻找能够充分探测特定模型特性的数据困难重重,仅评估模型自由生成内容的正确性本身就是一项挑战。为解决这一问题,许多评估方法开始依赖LLM自身作为裁判,对其他LLM的输出质量进行评分。目前最常见的做法是使用单个大型模型(如GPT4)进行评估。虽然这种方法日益普及,但其成本高昂,且已被证实会引入模型内偏差。此外,本研究还发现,超大规模模型往往并非必要。因此,我们提出使用LLM评估模型组(PoLL)来替代单一裁判。在三种不同裁判设置及六个不同数据集上的实验表明,采用由更多小型模型组成的PoLL,其评估表现优于单一大型裁判;由于包含来自不同模型家族的评价器,其模型内偏差更低;同时,其成本可降低七倍以上。