Indiscriminate Data Poisoning Attacks on Pre-trained Feature Extractors

Machine learning models have achieved great success in supervised learning tasks for end-to-end training, which requires a large amount of labeled data that is not always feasible. Recently, many practitioners have shifted to self-supervised learning methods that utilize cheap unlabeled data to learn a general feature extractor via pre-training, which can be further applied to personalized downstream tasks by simply training an additional linear layer with limited labeled data. However, such a process may also raise concerns regarding data poisoning attacks. For instance, indiscriminate data poisoning attacks, which aim to decrease model utility by injecting a small number of poisoned data into the training set, pose a security risk to machine learning models, but have only been studied for end-to-end supervised learning. In this paper, we extend the exploration of the threat of indiscriminate attacks on downstream tasks that apply pre-trained feature extractors. Specifically, we propose two types of attacks: (1) the input space attacks, where we modify existing attacks to directly craft poisoned data in the input space. However, due to the difficulty of optimization under constraints, we further propose (2) the feature targeted attacks, where we mitigate the challenge with three stages, firstly acquiring target parameters for the linear head; secondly finding poisoned features by treating the learned feature representations as a dataset; and thirdly inverting the poisoned features back to the input space. Our experiments examine such attacks in popular downstream tasks of fine-tuning on the same dataset and transfer learning that considers domain adaptation. Empirical results reveal that transfer learning is more vulnerable to our attacks. Additionally, input space attacks are a strong threat if no countermeasures are posed, but are otherwise weaker than feature targeted attacks.

翻译：机器学习模型在以端到端方式训练的监督学习任务中取得了巨大成功，但这需要大量难以获取的带标签数据。近年来，许多从业者转向利用廉价无标签数据的自监督学习方法，通过预训练学习通用特征提取器，随后仅需在有限带标签数据上训练额外线性层即可应用于个性化下游任务。然而，此类流程也可能引发数据投毒攻击的安全隐患。例如，旨在通过向训练集中注入少量中毒数据来降低模型效用的无差别数据投毒攻击，虽对机器学习模型构成安全威胁，但此前仅针对端到端监督学习展开研究。本文进一步探索了无差别攻击对采用预训练特征提取器的下游任务的威胁。具体而言，我们提出了两类攻击方法：（1）输入空间攻击——通过修改现有攻击直接在输入空间中构造中毒数据。鉴于约束条件下优化困难，我们进一步提出（2）特征目标攻击——通过三个阶段缓解该挑战：首先获取线性分类头的目标参数；其次将学习到的特征表示视为数据集，寻找中毒特征；最后将中毒特征逆向映射回输入空间。实验在"同数据集微调"与"考虑域迁移的迁移学习"两类主流下游任务中验证了此类攻击。实验结果表明：迁移学习更易受我们的攻击影响；此外，在无防御措施时输入空间攻击构成强威胁，但存在防御时其攻击效果弱于特征目标攻击。