Adversarial attacks are a major concern in security-centered applications, where malicious actors continuously try to mislead Machine Learning (ML) models into wrongly classifying fraudulent activity as legitimate, whereas system maintainers try to stop them. Adversarially training ML models that are robust against such attacks can prevent business losses and reduce the work load of system maintainers. In such applications data is often tabular and the space available for attackers to manipulate undergoes complex feature engineering transformations, to provide useful signals for model training, to a space attackers cannot access. Thus, we propose a new form of adversarial training where attacks are propagated between the two spaces in the training loop. We then test this method empirically on a real world dataset in the domain of credit card fraud detection. We show that our method can prevent about 30% performance drops under moderate attacks and is essential under very aggressive attacks, with a trade-off loss in performance under no attacks smaller than 7%.
翻译:对抗性攻击是安全导向应用中的主要关注点,恶意行为者持续试图误导机器学习(ML)模型,将欺诈活动错误分类为合法活动,而系统维护者则试图阻止此类行为。对ML模型进行对抗性训练以使其对这类攻击具有鲁棒性,可以防止业务损失并减轻系统维护者的工作负担。在此类应用中,数据通常是表格形式,且攻击者可操控的空间需经过复杂的特征工程变换,以提供对模型训练有用的信号,进而转化为攻击者无法访问的空间。因此,我们提出一种新的对抗性训练形式,其中攻击在训练循环中的两个空间之间传播。我们随后在信用卡欺诈检测领域的真实世界数据集上对该方法进行了实证测试。结果表明,在中等攻击下,我们的方法可防止约30%的性能下降;在极端攻击下,该方法至关重要,且无攻击时的性能权衡损失小于7%。