In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4, spanning a range of scenarios that include natural language generation (NLG) tasks such as translation, summarization, and our proposed Xiaohongshu blog writing task, and benchmarks for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of varying magnitudes.
翻译:本文首先论证了大型语言模型(LLM)在回答查询时,能力更强的模型会展现出比能力较弱模型更均匀的答案概率分布。基于这一核心发现,我们提出了一种名为ProbDiff的新型自评估方法,用于评估不同LLM的效能。该方法无需额外评估模型,也无需依赖GPT-4等外部专有模型进行判断。其独特之处在于利用被测LLM自身计算初始响应与其修订版本之间的概率差异。对于给定查询,两个LLM之间表现出更大差异者,其相对能力较弱。我们的研究结果表明,ProbDiff在包括翻译、摘要及我们提出的"小红书"博客写作任务等自然语言生成(NLG)任务,以及AlignBench、MT-Bench和AlpacaEval等LLM评估基准在内的多种场景中,其评估结果与基于GPT-4的评估结果相当,且适用于不同规模的LLM。