Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet $\leftrightarrow$ CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.
翻译:影响函数通常用于将模型行为归因于训练文档。我们探索了相反的过程:构建能够诱导模型行为的训练数据。我们的框架Infusion利用可扩展的影响函数近似方法,计算训练文档的微小扰动,这些扰动通过参数偏移诱导目标模型行为变化。我们在视觉和语言领域的数据投毒任务上评估Infusion。在CIFAR-10数据集上,我们证明仅对0.2%(100/45,000)的训练文档通过Infusion进行细微编辑,其效果即可与插入少量显式行为示例的基线方法相竞争。我们还发现Infusion具有跨架构迁移性(ResNet $\leftrightarrow$ CNN),表明单一投毒语料库可影响多个独立训练的模型。在初步的语言实验中,我们刻画了该方法何时能提升目标行为概率、何时会失效,发现其对于放大模型已学习行为最为有效。综合来看,这些结果表明对训练数据进行细微编辑能够系统性地塑造模型行为,这凸显了训练数据可解释性对攻击者与防御者的同等重要性。代码发布于:https://github.com/jrosseruk/infusion。