Adversarial attacks are a major deterrent towards the reliable use of machine learning models. A powerful type of adversarial attacks is the patch-based attack, wherein the adversarial perturbations modify localized patches or specific areas within the images to deceive the trained machine learning model. In this paper, we introduce Outlier Detection and Dimension Reduction (ODDR), a holistic defense mechanism designed to effectively mitigate patch-based adversarial attacks. In our approach, we posit that input features corresponding to adversarial patches, whether naturalistic or otherwise, deviate from the inherent distribution of the remaining image sample and can be identified as outliers or anomalies. ODDR employs a three-stage pipeline: Fragmentation, Segregation, and Neutralization, providing a model-agnostic solution applicable to both image classification and object detection tasks. The Fragmentation stage parses the samples into chunks for the subsequent Segregation process. Here, outlier detection techniques identify and segregate the anomalous features associated with adversarial perturbations. The Neutralization stage utilizes dimension reduction methods on the outliers to mitigate the impact of adversarial perturbations without sacrificing pertinent information necessary for the machine learning task. Extensive testing on benchmark datasets and state-of-the-art adversarial patches demonstrates the effectiveness of ODDR. Results indicate robust accuracies matching and lying within a small range of clean accuracies (1%-3% for classification and 3%-5% for object detection), with only a marginal compromise of 1%-2% in performance on clean samples, thereby significantly outperforming other defenses.
翻译:对抗性攻击是机器学习模型可靠使用的主要障碍。其中一种强大的攻击类型是基于补丁的攻击,通过修改图像中的局部补丁或特定区域来欺骗训练好的机器学习模型。本文提出了一种整体防御机制——异常检测与降维(ODDR),旨在有效缓解基于补丁的对抗性攻击。在该方法中,我们假设与对抗性补丁对应的输入特征(无论是自然形态还是其他形态)会偏离剩余图像样本的固有分布,并可被识别为离群值或异常值。ODDR采用三阶段流水线:碎片化、分离与中和,提供了一种与模型无关的解决方案,适用于图像分类和目标检测任务。碎片化阶段将样本解析为若干片段,以供后续分离过程使用;在该过程中,异常检测技术识别并分离与对抗性扰动相关的异常特征。中和阶段利用降维方法处理异常特征,以减轻对抗性扰动的影响,同时保留机器学习任务所需的关键信息。在基准数据集和现有最优对抗性补丁上的大量测试表明了ODDR的有效性。结果表明,其鲁棒准确率与干净样本准确率相当且波动范围较小(分类任务为1%-3%,目标检测任务为3%-5%),仅在干净样本上的性能损失1%-2%,显著优于其他防御方法。