Recently, numerous new benchmarks have been established to evaluate the performance of large language models (LLMs) via either computing a holistic score or employing another LLM as a judge. However, these approaches suffer from data leakage due to the open access of the benchmark and inflexible evaluation process. To address this issue, we introduce $\textbf{TreeEval}$, a benchmark-free evaluation method for LLMs that let a high-performance LLM host an irreproducible evaluation session and essentially avoids the data leakage. Moreover, this LLM performs as an examiner to raise up a series of questions under a topic with a tree planing strategy, which considers the current evaluation status to decide the next question generation and ensures the completeness and efficiency of the evaluation process. We evaluate $6$ models of different parameter sizes, including $7$B, $13$B, and $33$B, and ultimately achieved the highest correlation coefficient with AlpacaEval2.0 using only around $45$ questions. We also conduct more analysis to show the robustness and reliability of TreeEval. Our code can be accessed via the provided https://github.com/Ashura5/TreeEval.
翻译:近年来,为评估大语言模型(LLMs)的性能,学界建立了众多新基准,这些基准或通过计算综合分数,或采用另一个LLM作为评判者。然而,由于基准的公开可访问性及评估过程的僵化,这些方法存在数据泄露问题。为解决此问题,我们提出$\textbf{TreeEval}$,一种免基准的LLM评估方法,该方法让一个高性能LLM主持一个不可复现的评估会话,从而从根本上避免数据泄露。此外,该LLM作为审查者,基于树规划策略围绕特定主题生成一系列问题,该策略根据当前评估状态决定下一个问题的生成,确保评估过程的完整性和高效性。我们评估了包括$7$B、$13$B和$33$B在内的$6$个不同参数规模的模型,最终仅使用约$45$个问题即实现了与AlpacaEval2.0最高的相关系数。我们还进行了更多分析以证明TreeEval的鲁棒性和可靠性。我们的代码可通过提供的https://github.com/Ashura5/TreeEval 访问。