LLM-as-a-Judge is a widely used method for evaluating the performance of Large Language Models (LLMs) across various tasks. We address the challenge of quantifying the uncertainty of LLM-as-a-Judge evaluations. While uncertainty quantification has been well-studied in other domains, applying it effectively to LLMs poses unique challenges due to their complex decision-making capabilities and computational demands. In this paper, we introduce a novel method for quantifying uncertainty designed to enhance the trustworthiness of LLM-as-a-Judge evaluations. The method quantifies uncertainty by analyzing the relationships between generated assessments and possible ratings. By cross-evaluating these relationships and constructing a confusion matrix based on token probabilities, the method derives labels of high or low uncertainty. We evaluate our method across multiple benchmarks, demonstrating a strong correlation between the accuracy of LLM evaluations and the derived uncertainty scores. Our findings suggest that this method can significantly improve the reliability and consistency of LLM-as-a-Judge evaluations.
翻译:LLM-as-a-Judge是一种广泛用于评估大型语言模型(LLMs)在各种任务中性能的方法。本文致力于解决量化LLM-as-a-Judge评估不确定性的挑战。尽管不确定性量化在其他领域已得到深入研究,但将其有效应用于LLMs仍面临独特挑战,这源于其复杂的决策能力和计算需求。本文提出一种新颖的不确定性量化方法,旨在提升LLM-as-a-Judge评估的可信度。该方法通过分析生成评估与可能评分之间的关系来量化不确定性。通过交叉验证这些关系并基于词元概率构建混淆矩阵,该方法推导出高或低不确定性的标签。我们在多个基准测试中评估了该方法,结果表明LLM评估的准确性与推导出的不确定性分数之间存在强相关性。我们的研究显示,该方法能显著提高LLM-as-a-Judge评估的可靠性和一致性。