Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models' intellectual property.
翻译:专有大语言模型(LLMs)蕴含巨大的经济价值,通常仅以黑盒API形式开放,然而攻击者仍可利用其输出通过蒸馏方法提取知识。现有防御机制仅专注于基于文本的蒸馏,而重要的基于逻辑值的蒸馏方法在很大程度上尚未得到充分探索。本研究从信息论角度分析该问题并提出一种有效的解决方案。我们利用教师模型逻辑值与输入查询在给定真实标签条件下的条件互信息(CMI)来刻画教师输出中与蒸馏相关的信息。该度量捕捉了有利于模型提取的上下文信息,从而启发我们通过CMI最小化来防御蒸馏。在理论分析的指导下,我们提出学习一个变换矩阵来纯化原始输出以增强抗蒸馏能力。进一步推导出受CMI启发的抗蒸馏目标函数来优化该变换,该方法能有效去除与蒸馏相关的信息同时保持输出效用。在多种LLMs和强蒸馏算法上的大量实验表明,所提方法在保持任务精度的同时能显著降低蒸馏性能,有效保护模型的知识产权。