As Large Language Models (LLMs) have become more advanced, they have outpaced our abilities to accurately evaluate their quality. Not only is finding data to adequately probe particular model properties difficult, but evaluating the correctness of a model's freeform generation alone is a challenge. To address this, many evaluations now rely on using LLMs themselves as judges to score the quality of outputs from other LLMs. Evaluations most commonly use a single large model like GPT4. While this method has grown in popularity, it is costly, has been shown to introduce intramodel bias, and in this work, we find that very large models are often unnecessary. We propose instead to evaluate models using a Panel of LLm evaluators (PoLL). Across three distinct judge settings and spanning six different datasets, we find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
翻译:随着大语言模型(LLMs)的不断发展,其能力已超越我们准确评估其质量的现有手段。不仅获取能充分探测特定模型属性的数据困难重重,连评估模型自由生成内容的正确性本身也是一项挑战。为解决这一问题,许多评估方法开始依赖大语言模型作为裁判,对其输出质量进行评分。现有评估最常使用单一大型模型(如GPT4)。尽管该方法日益流行,但成本高昂、存在模型内偏差,且本研究发现超大规模模型往往并非必要。我们提出使用大语言模型评估小组(PoLL)进行模型评估。在三种不同的裁判设置和六个数据集上的实验表明:由更多数量的小型模型组成的PoLL不仅优于单个大型裁判,更因其包含不同模型家族而展现出更低的模型内偏差,同时总成本降低超过七倍。