The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for any given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs using HumanEval demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws.
翻译:在可验证任务上,大型语言模型(LLMs)的性能通常通过 pass@k 来衡量,即在 k 次尝试中至少正确回答一次问题的概率。在固定预算下,更合适的指标是 coverage@cost,即作为总尝试次数函数的平均唯一正确答题数量。我们建立了这两个指标之间的联系,并证明 pass@k 中经验观测到的幂律行为会导致 coverage@cost 的次线性增长(收益递减)。为解决此问题,我们提出重置与丢弃(ReD)——一种 LLM 查询方法,可在任意给定预算下提高 coverage@cost,且不受 pass@k 具体形式的影响。此外,给定 pass@k 时,我们可以定量预测使用 ReD 方法所需总尝试次数的节省量。若模型未提供 pass@k,ReD 可推断其幂律指数。在 HumanEval 基准上对三个 LLM 的实验表明,ReD 能显著降低达到目标覆盖率所需的尝试次数、令牌数量及美元成本,同时为测量推理幂律提供了高效途径。