Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices interpretability while introducing significant resource and operational overhead. To address these limitations, we introduce Prompt-Level Distillation (PLD). We extract explicit reasoning patterns from a Teacher model and organize them into a structured list of expressive instructions for the Student model's System Prompt. Evaluated on the StereoSet and Contract-NLI datasets using Gemma-3 4B, PLD improved Macro F1 scores from 57\% to 90.0\% and 67\% to 83\% respectively, enabling this compact model to match frontier performance with negligible latency overhead. These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-volume use cases and edge devices.
翻译:高级推理通常需要思维链提示,这种方法虽然准确,但会导致极高的延迟和大量的测试时推理成本。标准的替代方案——微调较小模型——常常以牺牲可解释性为代价,同时引入显著的资源和运营开销。为应对这些局限性,我们提出了提示级蒸馏。我们从教师模型中提取显式的推理模式,并将其组织成一个结构化的、富有表现力的指令列表,用于学生模型的系统提示。在StereoSet和Contract-NLI数据集上使用Gemma-3 4B进行评估,PLD将宏F1分数分别从57%提升至90.0%和从67%提升至83%,使该紧凑模型能够以可忽略的延迟开销达到前沿性能。这些富有表现力的指令使决策过程变得透明,允许对逻辑进行完整的人工验证,使得该方法非常适合法律、金融和内容审核等受监管行业,以及高吞吐量用例和边缘设备。