Poisoning Web-Scale Training Datasets is Practical

Deep learning models are often trained on distributed, web-scale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator's initial view of the dataset differs from the view downloaded by subsequent clients. By exploiting specific invalid trust assumptions, we show how we could have poisoned 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD. Our second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content -- such as Wikipedia -- where an attacker only needs a time-limited window to inject malicious examples. In light of both attacks, we notify the maintainers of each affected dataset and recommended several low-overhead defenses.

翻译：深度学习模型通常基于从互联网爬取的分布式、网络规模数据集进行训练。本文提出两种新型数据集投毒攻击方法，旨在通过恶意样本系统性地影响模型性能。我们的攻击具有即时实用性，目前已能对10个主流数据集实施投毒。第一种攻击名为"分视投毒"，利用互联网内容的可变特性，确保数据集标注者的初始视图与后续客户端下载的视图存在差异。通过利用特定的无效信任假设，我们证明仅需60美元即可投毒LAION-400M或COYO-700M数据集中0.01%的样本。第二种攻击名为"抢先投毒"，针对周期性抓取众包内容（如维基百科）的网络规模数据集，攻击者仅需有限时间窗口即可注入恶意样本。基于这两类攻击，我们已通知受影响数据集的管理者，并推荐了若干低开销防御方案。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日