Generate-then-rank is the dominant test-time scaling (TTS) paradigm for code generation, but scaling accuracy by sampling and executing more candidates makes comprehensive verification a major computational bottleneck. This creates an inherent trade-off between accuracy and compute that, despite its importance to TTS, is often ignored. Specifically, faster but noisier signals, such as outcome reward models (ORMs), are dismissed as suboptimal. We frame verifier selection as a Pareto optimization problem and empirically map the accuracy-throughput frontier across signals, including the full test suite, heuristics for selective execution, and ORMs, across four Python benchmarks. We show that ORMs are most effective at optimizing the Pareto curve when pruning is used in the generate-then-rank pipeline--known as staged verification--where lightweight filters remove obviously incorrect solutions, including candidates with small syntactic or character-level bugs, before expensive verification. Our pruning analysis shows that eliminating incorrect yet highly ranked candidates (often character-level bugs) prevents wasted compute on incorrect tokens. We find that ORMs with staged verification shift the Pareto frontier outward, achieving 11.64x higher throughput at a cost of 8.26% accuracy relative to full test-suite verification.
翻译:生成后排序是代码生成中主流的测试时扩展范式,但通过采样和执行更多候选方案来扩展准确率会使全面验证成为主要计算瓶颈。这造成了准确率与计算量之间的固有权衡,尽管这对测试时扩展至关重要,却常被忽视。具体而言,更快速但噪声更大的信号(如结果奖励模型)常被视为次优方案而被弃用。我们将验证器选择构建为帕累托优化问题,并通过实验绘制了包括完整测试套件、选择性执行启发式方法和结果奖励模型在内的多种信号在四个Python基准测试上的准确率-吞吐量边界。研究表明,当在生成后排序流程(称为分阶段验证)中采用剪枝策略时,结果奖励模型对帕累托曲线优化最为有效——轻量级过滤器会在昂贵验证前剔除明显错误的解决方案(包括存在微小语法或字符级错误的候选方案)。我们的剪枝分析表明,消除排名靠前但错误的候选方案(常为字符级错误)可避免在错误标记上浪费计算资源。研究发现,采用分阶段验证的结果奖励模型能将帕累托边界向外推移,相较于完整测试套件验证,在损失8.26%准确率的情况下实现11.64倍的吞吐量提升。