Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K and OASST1 benchmarks demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even when the student model's architecture is unknown.
翻译:模型蒸馏技术能够高效地模拟前沿大语言模型(LLM),这催生了对稳健检测机制的需求,以识别第三方学生模型是否在教师模型的输出上进行了训练。然而,现有的可用于检测此类蒸馏的指纹识别技术依赖于启发式扰动,这导致生成质量与指纹强度之间存在严重的权衡取舍:通常需要显著降低模型效用,才能确保指纹被学生模型有效内化。我们提出了反蒸馏指纹识别(ADFP),这是一种将指纹识别目标与学生模型学习动态对齐的原理性方法。该方法基于反蒸馏采样的梯度框架,利用一个代理模型来识别并采样那些能直接最大化学生模型在微调后指纹可检测性的期望值的词元,而非依赖于学生模型对一种更简单水印的非针对性偏差的偶然吸收。在GSM8K和OASST1基准测试上的实验表明,ADFP相较于现有最先进的基线方法实现了显著的帕累托改进,即使在学生模型架构未知的情况下,也能以对模型效用的最小影响,获得更强的检测置信度。