Frozen small code models (<=1.5B parameters, run locally without fine-tuning) suit offline and privacy-constrained use, but often emit plausible-but-wrong programs. A natural remedy is a post-hoc operator that selects, verifies, repairs, or re-processes the model's samples without retraining; in principled form it is Popperian: attack each candidate with a severe test, keep what survives. We measure whether such operators help. Under one deterministic execution oracle and a leakage-free, matched-compute protocol, 26 semantic post-hoc operators (selection, verification, repair, elimination, portfolios, sound vetoes, generation conditioning) are evaluated against Best-of-N (BoN); on the cells and benchmarks tested, none improves held-out accuracy over BoN. The negative is mechanistic: a coverage wall (systematic hard-task failures deeper sampling does not rescue), a capability scissors (a competent generator leaves almost no discriminable error among visible-test passers), and a near-empty consensus trap (the visible-pass-but-hidden-wrong majority a leakage-free selector needs rarely co-occurs with a correct alternative). A distribution-free do-no-harm bound cannot certify a harm rate <=alpha at zero observed harm unless n>=45. Two operators help on a different axis, outside the semantic output space. An expression-layer recovery (M1), the only accuracy gain here, recovers correct programs the standard extractor discards (robust extraction and public-test signature alignment); it does no harm (b10=0), is leakage-free, and lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4). An adaptive consensus early-stop (ACE) is a calibrated compute-saving control (~19% saving, zero harm). M1 and the selection negative replicate on HumanEval+ and MBPP+ across three model cells. The lesson: fix the harness and measure coverage before blaming semantic post-hoc reasoning.
翻译:冻结小代码模型(参数≤1.5B,本地运行无需微调)适用于离线及隐私受限场景,但常输出看似合理却错误的程序。一种自然补救方案是采用事后操作符,在无需重新训练的前提下对模型样本进行选择、验证、修复或再处理;其形式本质遵循波普尔原则:对每个候选方案施以严格检验,仅保留通过者。我们测量了此类操作符的实际效用。在确定性执行预言机与无泄漏、计算量匹配的协议下,26种语义事后操作符(包括选择、验证、修复、消除、组合、合理否决及生成条件约束)与Best-of-N(BoN)进行对比。在测试的基准测试集与测试单元上,无操作符能提升BoN的保留集准确率。负面结果源于三个机制:覆盖率墙(深层采样无法修复系统性困难任务失败)、能力剪刀差(高效生成器使可视化测试通过者几乎无误差可辨)以及近空共识陷阱(无泄漏选择器所需的可视化通过但隐藏错误多数样本,极少与正确替代方案共现)。基于分布自由的"无害"上界证明:在零观测危害条件下,若要保证危害率≤α,样本量需≥45。两种操作符在语义输出空间之外的不同维度发挥作用:表达式层恢复(M1)是本实验中唯一提升准确率的方法——它恢复了标准提取器丢弃的正确程序(鲁棒提取与公共测试签名对齐),无危害(b10=0)、无泄漏,使DeepSeek-Coder-1.3B在HumanEval+上提升+12个任务(p=2.4e-4)。自适应共识早停(ACE)提供校准的计算节省控制(节省约19%,零危害)。M1与负面结果在HumanEval+和MBPP+上经三个模型单元验证。结论:在归咎语义事后推理前,应先修复测试框架并测量覆盖率。