Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. However, provably bounding model behavior under such attacks remains an open problem. In this work, we address this challenge and develop the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data. In particular, our framework certifies robustness against untargeted and targeted poisoning as well as backdoor attacks for both input and label manipulations. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based learning algorithm. Given this set of parameters, we provide bounds on worst-case behavior, including model performance and backdoor success rate. We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.
翻译:现代机器学习流水线依赖大量公开数据,这导致无法保证数据质量,并使模型面临投毒攻击和后门攻击的威胁。然而,在此类攻击下对模型行为提供可证明的边界约束仍是一个开放性问题。本研究针对这一挑战,首次提出一个框架,可为使用潜在篡改数据训练的模型行为提供可证明的保障。具体而言,该框架能够认证模型在非定向投毒、定向投毒以及后门攻击下(包括输入与标签篡改场景)的鲁棒性。我们采用凸松弛方法对给定投毒威胁模型下所有可能的参数更新集合进行超近似,从而能够为任意基于梯度的学习算法中所有可达参数集合划定边界。基于该参数集合,我们进一步提供最坏情况行为的边界约束,包括模型性能与后门攻击成功率。我们在多个真实世界数据集(涵盖能耗分析、医学影像及自动驾驶等应用场景)上验证了该方法。