Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated using Gemma-3 4B, PLD improved Macro F1 scores on StereoSet (57\% to 90.0\%) and Contract-NLI (67\% to 83\%), while increasing LogiQA accuracy to 70\%. Similar results on Mistral Small 3.1 demonstrate cross-architecture generalizability, enabling these compact models to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.
翻译:高级推理通常需要采用思维链提示,该方法虽精准但会带来高昂的延迟和大量的测试时推理成本。替代方案(微调小型模型)虽常见,却往往以牺牲可解释性为代价,同时引入大量资源与操作开销。为克服这些局限,我们提出提示级蒸馏(Prompt-Level Distillation, PLD)。该方法从教师模型中提取显式推理模式,并将其组织为结构化指令列表,作为学生模型的系统提示。基于Gemma-3 4B模型的评估显示,PLD将StereoSet的Macro F1分数从57%提升至90.0%,Contract-NLI从67%提升至83%,并将LogiQA准确率提高至70%。在Mistral Small 3.1上的类似结果证明了其跨架构的泛化能力,使这些轻量模型在接近前沿性能的同时,几乎不增加延迟开销。这些表达性指令使决策过程透明化,支持人类对逻辑的全面验证,因此该方法特别适用于法律、金融及内容审核等受监管行业,以及高吞吐量场景和边缘设备。