Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage - the fraction of problems solved by any attempt - scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. Finally, we find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers. When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.

翻译：增加用于训练语言模型的计算量已显著提升了其能力。然而，在推理阶段，我们通常将计算量限制为每个问题仅尝试一次。本文通过增加生成样本的数量，探索将推理计算量作为另一个扩展维度。在多个任务和模型上，我们观察到覆盖率——即被任意一次尝试解决的问题比例——随样本数量在四个数量级范围内呈增长趋势。在编程和形式化证明等所有答案均可自动验证的领域，这些覆盖率的提升直接转化为性能的改进。当我们将重复采样应用于SWE-bench Lite时，使用DeepSeek-V2-Coder-Instruct解决的问题比例从单样本的15.9%提升至250个样本时的56%，超越了使用更强前沿模型单次尝试达到的43%的最优水平。此外，按当前API定价计算，使用较便宜的DeepSeek模型生成五个样本的成本效益更高，且比支付溢价使用GPT-4o或Claude 3.5 Sonnet生成单样本能解决更多问题。值得注意的是，覆盖率与样本数量的关系常呈对数线性，可通过指数幂律建模，这暗示着推理阶段缩放律的存在。最后，我们发现从大量生成结果中识别正确样本，在缺乏自动验证器的领域仍是未来研究的重要方向。当使用Llama-3模型解决GSM8K和MATH的数学应用题时，10,000个样本可使覆盖率增长至95%以上。然而，从样本集合中选取正确答案的常用方法（如多数投票或奖励模型）在超过数百个样本后趋于饱和，无法充分利用样本预算实现完全扩展。