Malicious perturbations embedded in input data, known as Trojan attacks, can cause neural networks to misbehave. However, the impact of a Trojan attack is reduced during fine-tuning of the model, which involves transferring knowledge from a pretrained large-scale model like visual question answering (VQA) to the target model. To mitigate the effects of a Trojan attack, replacing and fine-tuning multiple layers of the pretrained model is possible. This research focuses on sample efficiency, stealthiness and variation, and robustness to model fine-tuning. To address these challenges, we propose an instance-level Trojan attack that generates diverse Trojans across input samples and modalities. Adversarial learning establishes a correlation between a specified perturbation layer and the misbehavior of the fine-tuned model. We conducted extensive experiments on the VQA-v2 dataset using a range of metrics. The results show that our proposed method can effectively adapt to a fine-tuned model with minimal samples. Specifically, we found that a model with a single fine-tuning layer can be compromised using a single shot of adversarial samples, while a model with more fine-tuning layers can be compromised using only a few shots.
翻译:嵌入在输入数据中的恶意扰动(即木马攻击)可能导致神经网络行为异常。然而,在模型微调过程中,当预训练大模型(如视觉问答VQA)的知识迁移至目标模型时,木马攻击的影响会减弱。为缓解木马攻击的效果,可对预训练模型的多个层进行替换和微调。本研究聚焦于样本效率、隐蔽性与多样性,以及对模型微调的鲁棒性。针对这些挑战,我们提出一种实例级木马攻击方法,可跨输入样本和模态生成多样化木马。通过对抗学习建立指定扰动层与微调模型异常行为之间的关联。我们在VQA-v2数据集上采用多种指标进行了大量实验。结果表明,所提方法能以极少量样本有效适配微调模型。具体而言,我们发现仅需单次对抗样本即可攻破单微调层模型,而多微调层模型也仅需少量样本即可被攻破。