Detecting Instruction Fine-tuning Attacks using Influence Function

Instruction fine-tuning attacks pose a serious threat to large language models (LLMs) by subtly embedding poisoned examples in fine-tuning datasets, leading to harmful or unintended behaviors in downstream applications. Detecting such attacks is challenging because poisoned data is often indistinguishable from clean data, and prior knowledge of triggers or attack strategies is rarely available. We present a detection method that requires no prior knowledge of the attack. Our approach leverages influence functions under semantic transformation by comparing influence distributions before and after semantic inversions to identify critical poisons, defined as examples whose influence is strong and remains unchanged across transformations. We introduce a multi-transform ensemble approach that achieves F1 scores between 79.5 and 95.2 percent with precision between 66 and 100 percent on sentiment classification, significantly improving over single-transform methods. Our method generalizes to unseen transformation types with an F1 score of 86 percent through cross-category validation. We demonstrate effectiveness across multiple models, including T5-small and DeepSeek-Coder-1.3B, and across tasks such as sentiment classification and math reasoning. Removing a small fraction of detected poisons, between 1 and 3 percent of the data, restores model performance to near-clean levels. These results demonstrate the practicality of influence-based diagnostics for defending against instruction fine-tuning attacks in real-world large language model deployment. Artifact available at https://github.com/lijiawei20161002/Poison-Detection. Warning: this paper contains offensive data examples.

翻译：指令微调攻击通过在微调数据集中隐式嵌入中毒样本，导致大语言模型在下游应用中产生有害或意外行为，构成严重威胁。由于中毒数据通常与干净数据难以区分，且攻击的触发模式或策略的先验知识极少可得，检测此类攻击具有挑战性。本文提出一种无需攻击先验知识的检测方法。该方法利用语义变换下的影响函数，通过比较语义反转前后的影响分布来识别关键中毒样本——即那些影响力强且在变换中保持不变的样本。我们引入一种多变换集成方法，在情感分类任务上实现了79.5%至95.2%的F1分数与66%至100%的精确率，较单变换方法有显著提升。通过跨类别验证，该方法对未见变换类型仍能达到86%的F1分数，展现出良好的泛化能力。我们在多个模型（包括T5-small和DeepSeek-Coder-1.3B）及多种任务（如情感分类与数学推理）上验证了方法的有效性。仅移除检测出的少量中毒样本（占数据量的1%至3%），即可使模型性能恢复至接近干净数据的水平。这些结果表明，基于影响的诊断方法对于实际大语言模型部署中防御指令微调攻击具有实用价值。实验代码已开源：https://github.com/lijiawei20161002/Poison-Detection。警告：本文包含具有冒犯性的数据示例。