Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently witnessed remarkable advancements and are increasingly being deployed in real-world applications. However, inheriting the sensitivity of visual neural networks, LVLMs remain vulnerable to adversarial attacks, which can result in erroneous or malicious outputs. While existing efforts utilize adversarial fine-tuning to enhance robustness, they often suffer from performance degradation on clean inputs. In this paper, we proposes AdPO, a novel adversarial defense strategy for LVLMs based on preference optimization. For the first time, we reframe adversarial training as a preference optimization problem, aiming to enhance the model's preference for generating normal outputs on clean inputs while rejecting the potential misleading outputs for adversarial examples. Notably, AdPO achieves this by solely modifying the image encoder, e.g., CLIP ViT, resulting in superior clean and adversarial performance in a variety of downsream tasks. Considering that training involves large language models (LLMs), the computational cost increases significantly. We validate that training on smaller LVLMs and subsequently transferring to larger models can achieve competitive performance while maintaining efficiency comparable to baseline methods. Our comprehensive experiments confirm the effectiveness of the proposed AdPO, which provides a novel perspective for future adversarial defense research.
翻译:大型视觉语言模型(LVLM),如GPT-4o和LLaVA,近年来取得了显著进展,并越来越多地部署在实际应用中。然而,由于继承了视觉神经网络的敏感性,LVLM仍然容易受到对抗攻击,可能导致错误或恶意的输出。尽管现有工作利用对抗性微调来增强鲁棒性,但它们往往在干净输入上出现性能下降。本文提出AdPO,一种基于偏好优化的新型LVLM对抗防御策略。我们首次将对抗训练重新定义为偏好优化问题,旨在增强模型在干净输入上生成正常输出的偏好,同时拒绝对抗样本可能产生的误导性输出。值得注意的是,AdPO仅通过修改图像编码器(例如CLIP ViT)实现这一目标,从而在多种下游任务中取得了优异的干净样本和对抗样本性能。考虑到训练涉及大型语言模型(LLM),计算成本显著增加。我们验证了在较小LVLM上进行训练,随后迁移到更大模型,可以在保持与基线方法相当的效率的同时,获得有竞争力的性能。全面的实验证实了所提AdPO的有效性,为未来的对抗防御研究提供了新的视角。