Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet $\leftrightarrow$ CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.
翻译:影响函数常被用于将模型行为归因于训练文档。我们探索其逆向应用:构建能够诱导特定模型行为的训练数据。我们的框架Infusion通过可扩展的影响函数近似方法,对训练文档施加微小扰动,从而通过参数偏移实现模型行为的定向改变。我们在视觉和语言领域的投毒任务上评估了Infusion的性能。在CIFAR-10数据集上,仅对0.2%(100/45,000)的训练文档进行微调编辑即可与插入少量显式行为示例的基线方法相媲美。我们还发现Infusion具有跨架构迁移能力(ResNet ↔ CNN),这意味着单个中毒数据集可影响多个独立训练的模型。在初步语言实验中,我们刻画了该方法在何种条件下能提升目标行为概率、何种条件下会失效,发现其对模型已学习行为的放大效果最为显著。综合而言,这些结果表明对训练数据的细微编辑可以系统性地塑造模型行为,凸显了训练数据可解释性对攻防双方的重要性。代码参见:https://github.com/jrosseruk/infusion。