Pass$@k$ is widely used to report performance for LLM reasoning, but it often yields unstable, misleading rankings, especially when the number of trials (samples) is limited and compute is constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the Bayesian/avg procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Code is available at https://github.com/mohsenhariri/scorio
翻译:Pass$@k$被广泛用于报告LLM推理性能,但其常产生不稳定且具有误导性的排名,尤其在试验次数(样本量)有限且计算资源受限时。我们提出了一个基于贝叶斯原理的评估框架,该框架用模型潜在成功概率的后验估计和可信区间替代了Pass$@k$及$N$次试验的平均准确率(avg$@N$),从而产生稳定的排名和用于判断差异的透明决策规则。评估结果被建模为分类变量(而不仅仅是0/1),采用狄利克雷先验,为任何加权评分标准的后验均值和不确定性提供了闭式表达式,并允许在适当时使用先验证据。理论上,在均匀先验下,贝叶斯后验均值与平均准确率(Pass$@1$)序等价,这解释了其经验稳健性,同时增加了原理性的不确定性度量。实证上,在已知真实成功率的模拟实验中,以及在AIME'24/'25、HMMT'25和BrUMO'25数据集上,贝叶斯/平均(Bayesian/avg)方法比Pass$@k$及其近期变体实现了更快的收敛速度和更高的排名稳定性,从而能够在远小于当前所需的样本量下进行可靠的比较。该框架阐明了何时观察到的差距具有统计意义(可信区间不重叠)而非噪声,并自然地扩展到基于评分标准的等级评估。综上所述,这些结果建议用基于后验的、计算高效的方案替代Pass$@k$用于LLM评估和排名,该方案统一了二元和非二元评估,并使不确定性显式化。代码可在 https://github.com/mohsenhariri/scorio 获取。