As Large Language Models make a breakthrough in natural language processing tasks (NLP), multimodal technique becomes extremely popular. However, it has been shown that multimodal NLP are vulnerable to adversarial attacks, where the outputs of a model can be dramatically changed by a perturbation to the input. While several defense techniques have been proposed both in computer vision and NLP models, the multimodal robustness of models have not been fully explored. In this paper, we study the adversarial robustness provided by modifying loss function of pre-trained multimodal models, by restricting top K softmax outputs. Based on the evaluation and scoring, our experiments show that after a fine-tuning, adversarial robustness of pre-trained models can be significantly improved, against popular attacks. Further research should be studying, such as output diversity, generalization and the robustness-performance trade-off of this kind of loss functions. Our code will be available after this paper is accepted
翻译:随着大语言模型在自然语言处理任务中取得突破,多模态技术变得极为流行。然而,研究表明多模态自然语言处理容易受到对抗性攻击,即输入中的扰动可能导致模型输出发生显著变化。尽管在计算机视觉和自然语言处理模型中已提出了多种防御技术,但模型的鲁棒性尚未得到充分探索。本文通过限制前K个Softmax输出来修改预训练多模态模型的损失函数,研究由此带来的对抗鲁棒性。基于评估和评分,我们的实验表明,经过微调后,预训练模型的对抗鲁棒性在应对常见攻击时能得到显著提升。未来的研究应进一步探讨此类损失函数的输出多样性、泛化能力以及鲁棒性与性能之间的权衡。我们的代码将在本文被接收后公开。