In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4, spanning a range of scenarios that include natural language generation (NLG) tasks such as translation, summarization, and our proposed Xiaohongshu blog writing task, and benchmarks for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of varying magnitudes.
翻译:本文首先论证了大型语言模型(LLMs)在回答查询时,能力更强的模型会表现出更均匀的答案概率分布,而能力较弱的模型则相反。基于这一基础性发现,我们提出了一种名为ProbDiff的新型自我评估方法,用于评估各类LLMs的效能。该方法无需依赖额外的评估模型或外部专有模型(如GPT-4)进行判断,而是利用被测试的LLM自身计算初始回答与修订版本之间的概率差异。针对特定查询,若两个LLM之间的概率差异较大,则表明模型能力相对较弱。研究结果表明,ProbDiff在多种场景下达到了与基于GPT-4评估相当的结果,涵盖了自然语言生成(NLG)任务(如翻译、摘要及我们提出的“小红书”博客写作任务),以及LLM评估基准(如AlignBench、MT‑Bench和AlpacaEval),适用于不同规模的LLM。