Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the posterior-based procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://github.com/mohsenhariri/scorio

翻译：摘要：Pass$@k$被广泛用于报告大型语言模型的推理性能，但在试验数量（样本）有限且计算资源受限的情况下，它经常产生不稳定且可能具有误导性的排名。我们提出一个基于贝叶斯原理的评估框架，用模型潜在成功概率的后验估计和可信区间取代Pass$@k$及$N$次试验的平均准确率（avg$@N$），从而得到稳定的排名和透明的差异决策规则。评估结果被建模为分类变量（不仅仅是0/1），并采用狄利克雷先验，从而为任何加权评分标准得到后验均值和不确定性的闭式表达式，并能在适当时利用先验证据。理论上，在均匀先验下，贝叶斯后验均值与平均准确率（Pass$@1$）在序关系上等价，这解释了其经验稳健性，同时增加了原则性不确定性。在已知真实成功率的模拟实验以及AIME'24/'25、HMMT'25和BrUMO'25的实际数据上，基于后验的方法相比Pass$@k$及其近期变体实现了更快的收敛速度和更高的排名稳定性，从而能够在远更小的样本量下进行可靠比较。该框架明确了观察到的差距在何时具有统计意义（非重叠可信区间）而何时只是噪声，并自然扩展到分级、基于评分标准的评估。综上所述，这些结果建议用基于后验、计算高效的协议取代大型语言模型评估和排名中的Pass$@k$，该协议统一了二元和非二元评估，同时明确量化了不确定性。源代码可在https://github.com/mohsenhariri/scorio获取。