Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.
翻译:知识蒸馏是一种广泛采用的技术,用于将大型语言模型的能力迁移到更小、更高效的学生模型中。然而,未经授权的知识蒸馏利用不公平地利用了开发前沿模型所投入的大量努力和成本。我们研究了修改教师生成推理追踪的方法,以实现两个阻止未经授权蒸馏的目标:(1) \emph{抗蒸馏},即降低查询响应的训练实用性,以及(2) \emph{API水印},即在学生模型中嵌入可验证的签名。我们提出了几种动态重写教师推理输出的方法,同时保持答案正确性和语义连贯性。其中两种方法利用了大型语言模型的重写能力,而其他方法则使用基于梯度的技术。我们的实验表明,一种简单的基于指令的重写方法在保持甚至提升教师性能的同时,实现了强大的抗蒸馏效果。此外,我们还展示了我们的重写方法也支持高度可靠的水印检测,且基本没有误报。