Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. Provably bounding model behavior under such attacks remains an open problem. In this work, we address this challenge by developing the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm. In particular, our framework certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based learning algorithm. Given this set of parameters, we provide bounds on worst-case behavior, including model performance and backdoor success rate. We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.
翻译:现代机器学习流程大量依赖公开数据,导致数据质量难以保障,使得模型易受投毒攻击和后门攻击。在遭受此类攻击时对模型行为提供可证明的边界仍是一个开放性问题。本研究通过开发首个无需修改模型或学习算法即可为使用潜在篡改数据训练的模型行为提供可证明保证的框架,解决了这一挑战。具体而言,针对训练输入和标签的有界及无界篡改,我们的框架可认证模型对无目标投毒、有目标投毒及后门攻击的鲁棒性。该方法利用凸松弛技术对给定投毒威胁模型下所有可能的参数更新集合进行过近似,从而能为任意基于梯度的学习算法界定所有可达参数集合。基于该参数集合,我们提供了最坏情况行为(包括模型性能和后门攻击成功率)的边界。我们通过能源消耗、医学影像及自动驾驶等多个实际应用数据集验证了该方法的有效性。