Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models. However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models. In this work, we formally define the task of distillation data detection, which is uniquely challenging due to the partial availability of distillation data. Then, we propose a novel and effective method Token Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens. Our method is motivated by the analysis that distilled models tend to generate near-deterministic tokens for seen questions, while producing more low-probability tokens for unseen questions. Our key idea behind TBD is to quantify how far the generated tokens' probabilities deviate from a high reference probability. In effect, our method achieves competitive detection performance by producing lower scores for seen questions than for unseen questions. Extensive experiments demonstrate the effectiveness of our method, achieving an AUC of 0.918 and a TPR@1% FPR of 0.470 on the S1 dataset.
翻译:推理蒸馏已成为增强大型语言模型推理能力的高效且强大的范式。然而,推理蒸馏可能无意中导致基准污染,即蒸馏数据集中包含的评估数据会虚增蒸馏模型的性能指标。在本工作中,我们正式定义了蒸馏数据检测任务,该任务因蒸馏数据的部分可获得性而具有独特的挑战性。随后,我们提出了一种新颖有效的方法——令牌概率偏差法,该方法利用了生成输出令牌的概率模式。我们的方法基于以下分析:蒸馏模型倾向于对已见过的问题生成近乎确定性的令牌,而对未见问题则产生更多低概率令牌。TBD背后的核心思想是量化生成令牌的概率偏离高参考概率的程度。实际上,我们的方法通过为已见过问题生成比未见问题更低的分数,实现了具有竞争力的检测性能。大量实验证明了我们方法的有效性,在S1数据集上达到了0.918的AUC和0.470的TPR@1% FPR。