Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and unevaluated. We present DistillGuard, a framework for systematically evaluating output-level defenses against LLM knowledge distillation. We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and Qwen2.5-7B-Instruct as student across three benchmarks (MATH-500, HumanEval+, MT-Bench). Our results reveal that, in a same-family distillation setting against a naive attacker, most output-level defenses are surprisingly ineffective: paraphrasing-based perturbation barely degrades distilled student quality, and data poisoning primarily impairs conversational fluency while leaving task-specific capabilities intact. Only chain-of-thought removal substantially impairs mathematical reasoning (31.4\% vs.\ 67.8\% baseline), though code generation remains unaffected. These findings demonstrate that the effectiveness of distillation defenses is highly task-dependent and that current output-level approaches are insufficient to broadly prevent knowledge theft.
翻译:从专有大语言模型API进行知识蒸馏对模型提供商构成了日益增长的威胁,然而针对此类攻击的防御措施仍然零散且缺乏系统评估。本文提出DistillGuard,一个用于系统评估针对大语言模型知识蒸馏的输出级防御框架。我们建立了包含三类防御策略的分类体系——输出扰动、数据投毒和信息限流——并采用标准化评估流程对九种防御配置进行了测试,其中以Qwen3-14B作为教师模型,Qwen2.5-7B-Instruct作为学生模型,在三个基准测试集(MATH-500、HumanEval+、MT-Bench)上展开评估。研究结果表明,在面对简单攻击者的同系列模型蒸馏场景中,大多数输出级防御措施的效果出人意料地有限:基于改写的扰动几乎不会降低蒸馏学生模型的质量;数据投毒主要损害对话流畅度,而任务特定能力基本不受影响。仅当移除思维链时,数学推理能力出现显著下降(31.4\% vs.\ 67.8\%基线),但代码生成能力仍保持不变。这些发现表明,蒸馏防御的有效性高度依赖于具体任务,且当前输出级防御方法尚不足以广泛防止知识窃取。