Dark Distillation: Backdooring Distilled Datasets without Accessing Raw Data

Dataset distillation (DD) enhances training efficiency and reduces bandwidth by condensing large datasets into smaller synthetic ones. It enables models to achieve performance comparable to those trained on the raw full dataset and has become a widely adopted method for data sharing. However, security concerns in DD remain underexplored. Existing studies typically assume that malicious behavior originates from dataset owners during the initial distillation process, where backdoors are injected into raw datasets. In contrast, this work is the first to address a more realistic and concerning threat: attackers may intercept the dataset distribution process, inject backdoors into the distilled datasets, and redistribute them to users. While distilled datasets were previously considered resistant to backdoor attacks, we demonstrate that they remain vulnerable to such attacks. Furthermore, we show that attackers do not even require access to any raw data to inject the backdoors successfully. Specifically, our approach reconstructs conceptual archetypes for each class from the model trained on the distilled dataset. Backdoors are then injected into these archetypes to update the distilled dataset. Moreover, we ensure the updated dataset not only retains the backdoor but also preserves the original optimization trajectory, thus maintaining the knowledge of the raw dataset. To achieve this, a hybrid loss is designed to integrate backdoor information along the benign optimization trajectory, ensuring that previously learned information is not forgotten. Extensive experiments demonstrate that distilled datasets are highly vulnerable to backdoor attacks, with risks pervasive across various raw datasets, distillation methods, and downstream training strategies. Moreover, our attack method is efficient, capable of synthesizing a malicious distilled dataset in under one minute in certain cases.

翻译：数据集蒸馏通过将大型数据集压缩为小型合成数据集，显著提升了训练效率并降低了带宽需求。该方法使模型能够达到与在原始完整数据集上训练相当的性能，已成为数据共享中广泛采用的技术。然而，数据集蒸馏的安全性研究仍显不足。现有研究通常假设恶意行为源自数据集所有者在初始蒸馏过程中向原始数据集注入后门。与此不同，本研究首次探讨了一种更现实且值得关注的威胁：攻击者可能拦截数据集分发流程，向蒸馏数据集注入后门，并将其重新分发给用户。尽管蒸馏数据集曾被认为对后门攻击具有抵抗力，但我们证明其仍易受此类攻击。更重要的是，我们发现攻击者甚至无需访问任何原始数据即可成功注入后门。具体而言，我们的方法从基于蒸馏数据集训练的模型中重构每个类别的概念原型，随后将后门注入这些原型以更新蒸馏数据集。此外，我们确保更新后的数据集不仅能保留后门，还能维持原始优化轨迹，从而保持原始数据集的知识。为实现这一目标，我们设计了混合损失函数，将后门信息整合到良性优化轨迹中，确保先前学习的信息不被遗忘。大量实验表明，蒸馏数据集对后门攻击高度脆弱，这种风险普遍存在于各类原始数据集、蒸馏方法及下游训练策略中。值得注意的是，我们的攻击方法具有高效性，在某些情况下能在不到一分钟内合成恶意蒸馏数据集。