Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.
翻译:红外视觉-语言模型(IR-VLMs)已成为低能见度环境下多模态感知的前沿范式,但其对对抗攻击的鲁棒性尚未得到充分探究。现有对抗性补丁方法主要针对封闭集场景下的RGB模型设计,难以直接应用于红外VLM的开放式语义理解与物理部署需求。为弥合这一差距,我们提出通用曲网格补丁(UCGP)——专为IR-VLMs设计的通用物理对抗补丁框架。UCGP融合曲网格曲面(CGM)参数化技术实现连续、低频、可部署的补丁生成,并采用统一表征驱动目标来促进子空间偏离、拓扑结构破坏与隐蔽性。为提升真实部署环境与域偏移下的鲁棒性,我们进一步引入元差分进化算法及EOT增强的TPS形变建模。不同于操纵标签或提示词,UCGP直接破坏视觉表征空间,削弱跨模态语义对齐。大量实验表明,UCGP能在保持跨模型迁移性、跨数据集泛化能力、真实物理有效性及防御鲁棒性的同时,持续削弱不同IR-VLM架构的语义理解能力。这些发现揭示了当前红外多模态系统中此前被忽视的鲁棒性脆弱性。