Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have employed proprietary close-source models, especially GPT-4, as the evaluator. Alternatively, other works have fine-tuned judge models based on open-source LLMs as the evaluator. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this study, we conduct an empirical study of judge models. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness, aspect-specific evaluation, and scalability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations. Finally, we propose an effective indicator to measure the reliability of fine-tuned judges, with the aim of maximizing their utility in LLM evaluation.
翻译:近年来,利用大型语言模型评估其他LLM质量的做法日益普遍。许多研究采用专有的闭源模型(尤其是GPT-4)作为评估器,亦有研究基于开源LLM微调评判模型作为评估工具。尽管微调评判模型声称能达到与GPT-4相当的评估能力,本研究通过实证分析发现:虽然微调评判模型在领域内测试集上表现优异甚至超越GPT-4,但在泛化性、公平性、细粒度评估和可扩展性等多个维度均不及GPT-4。我们进一步揭示微调评判模型本质上是任务特定的分类器,这种固有特性导致了其局限性。最后,我们提出一种有效指标来衡量微调评判模型的可靠性,以期最大化其在LLM评估中的效用。