Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have employed proprietary close-sourced models, especially GPT-4, as the evaluator. Alternatively, other works have fine-tuned judge models based on open-source LLMs as the evaluator. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of judge models. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness, aspect-specific evaluation, and scalability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations. Finally, we introduce a integrated method, leveraging GPT-4 to compensate for the limitations and improve the fine-tuned judges. Experiment results show our method achieves accuracy on par with GPT-4 with only 50% of the API expense.
翻译:近年来,利用大语言模型评估其他大语言模型质量的趋势日益显著。许多研究采用专有的闭源模型(尤其是GPT-4)作为评估器,亦有研究基于开源大语言模型微调评判模型作为评估工具。尽管微调评判模型被宣称能达到与GPT-4相当的评估能力,本研究通过实证分析发现:虽然微调评判模型在领域内测试集上表现优异甚至超越GPT-4,但在泛化性、公平性、细粒度维度评估及可扩展性等多个层面均不及GPT-4。研究进一步揭示,微调评判模型本质上是任务特定的分类器,这一特性从根本上限制了其评估能力。最后,我们提出一种融合方法,通过GPT-4补偿微调评判模型的局限性以提升其性能。实验表明,该方法仅需50%的API成本即可达到与GPT-4相当的评估准确度。