基于惊奇度引导的选择：面向执行验证代码生成的计算最优测试时策略 (Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation)

Test-time training (TTT) adapts language models through gradient-based updates at inference. But is adaptation the right strategy? We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B-parameter model (GPT-OSS-120B with LoRA adaptation), we find that search outperforms minimal adaptation (1-5 gradient steps): Best-of-N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3-seed mean), with TTT's "equivalent K" falling below 1, worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal-guided selection: selecting the highest-surprisal (lowest-confidence) correct sample yields 80% success vs. 50% for most-confident selection, a 30% improvement. Extending to surprisal-guided-top3 matches oracle performance at 100%. This zero-cost strategy, validated through length-controlled analysis, recovers oracle performance. For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal-guided selection principle may generalize to other execution-grounded domains where optimal solutions occupy the distribution tail.

翻译：测试时训练（TTT）通过在推理时进行基于梯度的更新来调整语言模型。但适应是否是正确的策略？我们研究了可验证执行验证（VEG）任务的计算最优测试时策略，这类任务（如GPU内核优化）具有确定性评估器，可提供密集、连续的奖励信号。以KernelBench为测试平台，并使用一个120B参数模型（经LoRA适配的GPT-OSS-120B），我们发现搜索策略优于最小化适应（1-5个梯度步）：在完整的KernelBench L1评估集上，当K=64时，N选一采样实现了90%的任务成功率（20个任务中完成18个），而TTT的最佳检查点仅达到30.6%（3次种子的平均值），且TTT的“等效K”值低于1，甚至不如单样本推理。其失败模式在于过度锐化：梯度更新使多样性坍缩至平庸解，而非发现最优解。我们的主要贡献是惊奇度引导的选择：选择最高惊奇度（最低置信度）的正确样本，其成功率可达80%，而最置信选择仅为50%，提升了30%。将其扩展至惊奇度引导的前3选择，则能达到100%的预言机性能。这种零成本策略通过长度控制分析得到验证，能够恢复预言机性能。对于密集奖励的VEG任务，计算资源应分配给样本多样性和智能选择，而非梯度适应。惊奇度引导的选择原则可能推广至其他最优解位于分布尾部的执行验证领域。