This paper experimentally analyzes how the level of harness engineering affects the operational performance of small language models (SLMs, 2-3B parameters). Three harness conditions - model-only (raw prompt), minimal-shell (wrapper tags), and a 4-stage pipeline (plan->execute->verify->recover) - are applied to three models (Gemma4 E2B, Qwen3.5:2B, LLaMA 3.2 3B) across 24 tasks, comparing Task Success Rate (TSR) and Valid TSR (VTSR). The pipeline harness achieves TSR=0.952 and VTSR=1.000 on Gemma4 E2B (T1-T5, 21 tasks). A non-monotonic phenomenon - minimal-shell TSR < model-only TSR - is observed in two models. In LLaMA 3.2 3B model-only, seven format violations yield TSR=0.429, revealing scaffold collapse: the model abandons JSON structure under complex format requirements without harness support. Ablation shows planning and recovery each contribute approximately 24.7% of total gain. VCR (Verification Catch Rate)=0.625 across all pipeline runs.
翻译:本文实验分析了控制框架工程水平如何影响小型语言模型(SLMs,2-3B参数)的运行性能。对三种模型(Gemma4 E2B、Qwen3.5:2B、LLaMA 3.2 3B),在24项任务上应用三种控制条件——仅模型(原始提示)、最小外壳(封装标签)以及四阶段流水线(规划→执行→验证→恢复),比较任务成功率(TSR)和有效任务成功率(VTSR)。流水线控制在Gemma4 E2B(T1-T5,21项任务)上实现TSR=0.952和VTSR=1.000。两个模型中观察到非单调现象——最小外壳TSR低于仅模型TSR。在LLaMA 3.2 3B仅模型条件下,七次格式违例导致TSR=0.429,揭示了框架崩溃:模型在缺乏控制框架支持时,面对复杂格式要求会放弃JSON结构。消融实验表明,规划与恢复模块各自贡献约24.7%的总增益。所有流水线运行的验证捕获率(VCR)=0.625。