Targeted adversarial attacks on closed-source multimodal large language models (MLLMs) have been increasingly explored under black-box transfer, yet prior methods are predominantly sample-specific and offer limited reusability across inputs. We instead study a more stringent setting, Universal Targeted Transferable Adversarial Attacks (UTTAA), where a single perturbation must consistently steer arbitrary inputs toward a specified target across unknown commercial MLLMs. Naively adapting existing sample-wise attacks to this universal setting faces three core difficulties: (i) target supervision becomes high-variance due to target-crop randomness, (ii) token-wise matching is unreliable because universality suppresses image-specific cues that would otherwise anchor alignment, and (iii) few-source per-target adaptation is highly initialization-sensitive, which can degrade the attainable performance. In this work, we propose MCRMO-Attack, which stabilizes supervision via Multi-Crop Aggregation with an Attention-Guided Crop, improves token-level reliability through alignability-gated Token Routing, and meta-learns a cross-target perturbation prior that yields stronger per-target solutions. Across commercial MLLMs, we boost unseen-image attack success rate by +23.7\% on GPT-4o and +19.9\% on Gemini-2.0 over the strongest universal baseline.
翻译:针对闭源多模态大语言模型(MLLMs)的目标对抗攻击在基于黑盒迁移的场景下日益受到探索,然而现有方法大多针对特定样本,在不同输入间的可复用性有限。我们转而研究一个更为严格的设定——通用目标可迁移对抗攻击(UTTAA),即要求单个扰动能够持续地将任意输入引导至指定目标,并在未知的商业MLLMs上保持有效。将现有的样本级攻击方法简单适配到此通用设定面临三个核心难题:(i)由于目标裁剪的随机性,目标监督变得高方差;(ii)由于通用性抑制了原本可锚定对齐的图像特定线索,词元级匹配变得不可靠;(iii)针对每个目标的少样本适配对初始化高度敏感,这可能导致可达到的性能下降。在本工作中,我们提出MCRMO-Attack,该方法通过结合注意力引导裁剪的多裁剪聚合来稳定监督,通过可对齐性门控的词元路由提升词元级可靠性,并元学习一个跨目标扰动先验,从而为每个目标生成更强的解决方案。在多个商业MLLMs上的实验表明,相较于最强的通用基线,我们在GPT-4o上将未见图像的攻击成功率提升了+23.7%,在Gemini-2.0上提升了+19.9%。