Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning. It defines success if any of $k$ independently sampled solutions passes a verifier. This multi-sample inference metric has motivated inference-aware fine-tuning methods that directly optimize pass@$k$. However, prior work reports a recurring trade-off: pass@k improves while pass@1 degrades under such methods. This trade-off is practically important because pass@1 often remains a hard operational constraint due to latency and cost budgets, imperfect verifier coverage, and the need for a reliable single-shot fallback. We study the origin of this trade-off and provide a theoretical characterization of when pass@k policy optimization can reduce pass@1 through gradient conflict induced by prompt interference. We show that pass@$k$ policy gradients can conflict with pass@1 gradients because pass@$k$ optimization implicitly reweights prompts toward low-success prompts; when these prompts are what we term negatively interfering, their upweighting can rotate the pass@k update direction away from the pass@1 direction. We illustrate our theoretical findings with large language model experiments on verifiable mathematical reasoning tasks.
翻译:Pass@k是可验证大语言模型任务(包括数学推理、代码生成和简答推理)中广泛使用的性能评估指标。该指标的定义为:若$k$个独立采样的解中有任一通过验证器,则判定为成功。这种多样本推理指标催生了直接优化pass@$k$的推理感知微调方法。然而,已有研究反复报告一个权衡现象:在此类方法下pass@k提升的同时pass@1性能会下降。该权衡具有重要实际意义,因为延迟与成本预算限制、验证器覆盖不完善、以及需要可靠的单次生成后备方案等因素,常使pass@1成为严格的运行约束。我们探究该权衡现象的成因,并从梯度冲突角度理论刻画了pass@k策略优化通过提示干扰降低pass@1性能的条件。研究证明:pass@$k$策略梯度可能与pass@1梯度产生冲突,因为pass@$k$优化会隐式地将低成功率提示的权重提升;当这些提示属于我们定义的负向干扰类型时,其权重提升会使pass@k更新方向偏离pass@1方向。我们通过在可验证数学推理任务上的大语言模型实验验证了理论发现。