Mithridates: Auditing and Boosting Backdoor Resistance of Machine Learning Pipelines

Machine learning (ML) models trained on data from potentially untrusted sources are vulnerable to poisoning. A small, maliciously crafted subset of the training inputs can cause the model to learn a "backdoor" task (e.g., misclassify inputs with a certain feature) in addition to its main task. Recent research proposed many hypothetical backdoor attacks whose efficacy heavily depends on the configuration and training hyperparameters of the target model. Given the variety of potential backdoor attacks, ML engineers who are not security experts have no way to measure how vulnerable their current training pipelines are, nor do they have a practical way to compare training configurations so as to pick the more resistant ones. Deploying a defense requires evaluating and choosing from among dozens of research papers and re-engineering the training pipeline. In this paper, we aim to provide ML engineers with pragmatic tools to audit the backdoor resistance of their training pipelines and to compare different training configurations, to help choose one that best balances accuracy and security. First, we propose a universal, attack-agnostic resistance metric based on the minimum number of training inputs that must be compromised before the model learns any backdoor. Second, we design, implement, and evaluate Mithridates a multi-stage approach that integrates backdoor resistance into the training-configuration search. ML developers already rely on hyperparameter search to find configurations that maximize the model's accuracy. Mithridates extends this standard tool to balance accuracy and resistance without disruptive changes to the training pipeline. We show that hyperparameters found by Mithridates increase resistance to multiple types of backdoor attacks by 3-5x with only a slight impact on accuracy. We also discuss extensions to AutoML and federated learning.

翻译：从潜在不可信来源数据训练的机器学习模型易受投毒攻击。一小部分恶意构造的训练样本可使模型在完成主要任务的同时学习"后门"任务（例如，对具有特定特征的输入进行错误分类）。近期研究提出了多种假设性后门攻击，其有效性高度依赖于目标模型的配置和训练超参数。面对多种潜在后门攻击，非安全领域的机器学习工程师既无法度量当前训练流水线的脆弱性，也缺乏实际方法比较不同训练配置以选择更具鲁棒性的方案。部署防御措施需评估并筛选数十篇研究论文，并重新设计训练流水线。本文旨在为机器学习工程师提供实用工具，用于审计训练流水线的后门鲁棒性并比较不同训练配置，以帮助选择兼顾精度与安全性的最佳方案。首先，我们提出一种与攻击无关的通用鲁棒性度量指标，基于模型学习任何后门前需被投毒的最小训练样本数量。其次，我们设计、实现并评估了Mithridates——一种将后门鲁棒性集成到训练配置搜索中的多阶段方法。机器学习开发者已依赖超参数搜索寻找最大化模型精度的配置。Mithridates扩展了这一标准工具，在不颠覆性改变训练流水线的前提下平衡精度与鲁棒性。实验表明，Mithridates发现的超参数可将多种后门攻击的鲁棒性提升3-5倍，且仅对精度产生轻微影响。我们还讨论了在自动机器学习与联邦学习场景中的扩展应用。