The performance of machine learning models depends on the quality of the underlying data. Malicious actors can attack the model by poisoning the training data. Current detectors are tied to either specific data types, models, or attacks, and therefore have limited applicability in real-world scenarios. This paper presents a novel fully-agnostic framework, DIVA (Detecting InVisible Attacks), that detects attacks solely relying on analyzing the potentially poisoned data set. DIVA is based on the idea that poisoning attacks can be detected by comparing the classifier's accuracy on poisoned and clean data and pre-trains a meta-learner using Complexity Measures to estimate the otherwise unknown accuracy on a hypothetical clean dataset. The framework applies to generic poisoning attacks. For evaluation purposes, in this paper, we test DIVA on label-flipping attacks.
翻译:机器学习模型的性能依赖于底层数据的质量。恶意行为者可能通过投毒训练数据攻击模型。现有检测器往往局限于特定数据类型、模型或攻击方式,因此在现实场景中应用受限。本文提出一种新型全未知检测框架DIVA(不可见攻击检测),仅通过分析可能被污染的数据集即可检测攻击行为。DIVA基于以下核心理念:通过比较分类器在污染数据和干净数据上的准确率,并利用复杂度度量预训练元学习器来估计假设性干净数据集上的未知准确率,从而检测投毒攻击。该框架适用于通用投毒攻击,本文为评估效果,在标签翻转攻击场景下对DIVA进行了测试。