Large language models (LLMs) have demonstrated significant utilities in real-world applications, exhibiting impressive capabilities in natural language processing and understanding. Benchmark evaluations are crucial for assessing the capabilities of LLMs as they can provide a comprehensive assessment of their strengths and weaknesses. However, current evaluation methods often overlook the inherent randomness of LLMs by employing deterministic generation strategies or relying on a single random sample, resulting in unaccounted sampling variance and unreliable benchmark score estimates. In this paper, we propose a hierarchical statistical model that provides a more comprehensive representation of the benchmarking process by incorporating both benchmark characteristics and LLM randomness. We show that leveraging multiple generations improves the accuracy of estimating the benchmark score and reduces variance. We also introduce $\mathbb P\left(\text{correct}\right)$, a prompt-level difficulty score based on correct ratios, providing fine-grained insights into individual prompts. Additionally, we create a data map that visualizes difficulty and semantic prompts, enabling error detection and quality control in benchmark construction.
翻译:大型语言模型(LLM)在实际应用中展现出显著效用,在自然语言处理与理解方面表现出令人印象深刻的能力。基准评估对于衡量LLM的能力至关重要,因其能够全面评估其优势与不足。然而,当前的评估方法常因采用确定性生成策略或依赖单一随机样本,而忽视了LLM固有的随机性,导致未计入抽样方差并产生不可靠的基准分数估计。本文提出一种分层统计模型,通过同时纳入基准特征与LLM随机性,为基准测试过程提供了更全面的表征。我们证明,利用多代生成能提高基准分数估计的准确性并降低方差。我们还引入了基于正确率的提示级难度分数 $\mathbb P\left(\text{correct}\right)$,为单个提示提供细粒度分析。此外,我们构建了一个可视化难度与语义提示的数据图谱,以支持基准构建中的错误检测与质量控制。