The massive deployment of Machine Learning (ML) models has been accompanied by the emergence of several attacks that threaten their trustworthiness and raise ethical and societal concerns such as invasion of privacy, discrimination risks, and lack of accountability. Model hijacking is one of these attacks, where the adversary aims to hijack a victim model to execute a different task than its original one. Model hijacking can cause accountability and security risks since a hijacked model owner can be framed for having their model offering illegal or unethical services. Prior state-of-the-art works consider model hijacking as a training time attack, whereby an adversary requires access to the ML model training to execute their attack. In this paper, we consider a stronger threat model where the attacker has no access to the training phase of the victim model. Our intuition is that ML models, typically over-parameterized, might (unintentionally) learn more than the intended task for they are trained. We propose a simple approach for model hijacking at inference time named SnatchML to classify unknown input samples using distance measures in the latent space of the victim model to previously known samples associated with the hijacking task classes. SnatchML empirically shows that benign pre-trained models can execute tasks that are semantically related to the initial task. Surprisingly, this can be true even for hijacking tasks unrelated to the original task. We also explore different methods to mitigate this risk. We first propose a novel approach we call meta-unlearning, designed to help the model unlearn a potentially malicious task while training on the original task dataset. We also provide insights on over-parameterization as one possible inherent factor that makes model hijacking easier, and we accordingly propose a compression-based countermeasure against this attack.
翻译:机器学习模型的大规模部署伴随着多种攻击的出现,这些攻击威胁着模型的可信度,并引发隐私侵犯、歧视风险和问责缺失等伦理与社会关切。模型劫持是此类攻击之一,攻击者旨在劫持受害者模型以执行不同于其原始任务的任务。模型劫持可能导致问责与安全风险,因为被劫持的模型所有者可能因其模型提供非法或不道德服务而被追责。先前的前沿研究将模型劫持视为训练时攻击,即攻击者需要访问机器学习模型的训练过程才能实施攻击。本文考虑一种更强的威胁模型:攻击者无法访问受害者模型的训练阶段。我们的直觉是,通常过度参数化的机器学习模型可能(无意中)学习到超出其训练预期任务的内容。我们提出一种名为SnatchML的推理时模型劫持方法,该方法通过计算输入样本在受害者模型潜在空间中与已知劫持任务类别样本的距离度量来实现未知样本分类。SnatchML实证研究表明,良性预训练模型能够执行与初始任务语义相关的任务。令人惊讶的是,即使对于与原始任务无关的劫持任务,这种现象也可能成立。我们还探索了多种缓解此风险的方法。首先提出一种称为元反学习的新方法,旨在帮助模型在原始任务数据集训练过程中消除潜在恶意任务的学习痕迹。此外,我们分析了过度参数化作为促使模型劫持更容易发生的潜在内在因素,并相应提出基于模型压缩的防御对策。