Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models

Frozen small code models (<=1.5B parameters, run locally without fine-tuning) suit offline and privacy-constrained use, but often emit plausible-but-wrong programs. A natural remedy is a post-hoc operator that selects, verifies, repairs, or re-processes the model's samples without retraining; in principled form it is Popperian: attack each candidate with a severe test, keep what survives. We measure whether such operators help. Under one deterministic execution oracle and a leakage-free, matched-compute protocol, 26 semantic post-hoc operators (selection, verification, repair, elimination, portfolios, sound vetoes, generation conditioning) are evaluated against Best-of-N (BoN); on the cells and benchmarks tested, none improves held-out accuracy over BoN. The negative is mechanistic: a coverage wall (systematic hard-task failures deeper sampling does not rescue), a capability scissors (a competent generator leaves almost no discriminable error among visible-test passers), and a near-empty consensus trap (the visible-pass-but-hidden-wrong majority a leakage-free selector needs rarely co-occurs with a correct alternative). A distribution-free do-no-harm bound cannot certify a harm rate <=alpha at zero observed harm unless n>=45. Two operators help on a different axis, outside the semantic output space. An expression-layer recovery (M1), the only accuracy gain here, recovers correct programs the standard extractor discards (robust extraction and public-test signature alignment); it does no harm (b10=0), is leakage-free, and lifts DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4). An adaptive consensus early-stop (ACE) is a calibrated compute-saving control (~19% saving, zero harm). M1 and the selection negative replicate on HumanEval+ and MBPP+ across three model cells. The lesson: fix the harness and measure coverage before blaming semantic post-hoc reasoning.

翻译：冻结小代码模型（参数≤1.5B，本地运行无需微调）适用于离线及隐私受限场景，但常输出看似合理却错误的程序。一种自然补救方案是采用事后操作符，在无需重新训练的前提下对模型样本进行选择、验证、修复或再处理；其形式本质遵循波普尔原则：对每个候选方案施以严格检验，仅保留通过者。我们测量了此类操作符的实际效用。在确定性执行预言机与无泄漏、计算量匹配的协议下，26种语义事后操作符（包括选择、验证、修复、消除、组合、合理否决及生成条件约束）与Best-of-N（BoN）进行对比。在测试的基准测试集与测试单元上，无操作符能提升BoN的保留集准确率。负面结果源于三个机制：覆盖率墙（深层采样无法修复系统性困难任务失败）、能力剪刀差（高效生成器使可视化测试通过者几乎无误差可辨）以及近空共识陷阱（无泄漏选择器所需的可视化通过但隐藏错误多数样本，极少与正确替代方案共现）。基于分布自由的"无害"上界证明：在零观测危害条件下，若要保证危害率≤α，样本量需≥45。两种操作符在语义输出空间之外的不同维度发挥作用：表达式层恢复（M1）是本实验中唯一提升准确率的方法——它恢复了标准提取器丢弃的正确程序（鲁棒提取与公共测试签名对齐），无危害（b10=0）、无泄漏，使DeepSeek-Coder-1.3B在HumanEval+上提升+12个任务（p=2.4e-4）。自适应共识早停（ACE）提供校准的计算节省控制（节省约19%，零危害）。M1与负面结果在HumanEval+和MBPP+上经三个模型单元验证。结论：在归咎语义事后推理前，应先修复测试框架并测量覆盖率。