Neural networks, such as image classifiers, are frequently trained on proprietary and confidential datasets. It is generally assumed that once deployed, the training data remains secure, as adversaries are limited to query response interactions with the model, where at best, fragments of arbitrary data can be inferred without any guarantees on their authenticity. In this paper, we propose the memory backdoor attack, where a model is covertly trained to memorize specific training samples and later selectively output them when triggered with an index pattern. What makes this attack unique is that it (1) works even when the tasks conflict (making a classifier output images), (2) enables the systematic extraction of training samples from deployed models and (3) offers guarantees on the extracted authenticity of the data. We demonstrate the attack on image classifiers, segmentation models, and a large language model (LLM). We demonstrate the attack on image classifiers, segmentation models, and a large language model (LLM). With this attack, it is possible to hide thousands of images and texts in modern vision architectures and LLMs respectively, all while maintaining model performance. The memory back door attack poses a significant threat not only to conventional model deployments but also to federated learning paradigms and other modern frameworks. Therefore, we suggest an efficient and effective countermeasure that can be immediately applied and advocate for further work on the topic.
翻译:神经网络(如图像分类器)通常在专有和机密数据集上进行训练。人们普遍认为,一旦模型部署后,训练数据是安全的,因为攻击者仅限于与模型进行查询-响应的交互,最多只能推断出任意数据的片段,且无法保证其真实性。本文提出了一种记忆后门攻击,其中模型被秘密训练以记忆特定的训练样本,并在接收到索引模式触发时选择性地输出这些样本。该攻击的独特之处在于:(1)即使在任务冲突的情况下(例如使分类器输出图像)也能生效;(2)能够从已部署的模型中系统性地提取训练样本;(3)为提取数据的真实性提供保证。我们在图像分类器、分割模型以及一个大语言模型(LLM)上演示了该攻击。通过这种攻击,可以在现代视觉架构和LLM中分别隐藏数千张图像和文本,同时保持模型性能。记忆后门攻击不仅对传统的模型部署构成重大威胁,也对联邦学习范式及其他现代框架构成威胁。因此,我们提出了一种可立即应用的高效有效对策,并倡导就该主题开展进一步的研究工作。