While multimodal large language models (MLLMs) have shown strong visual reasoning abilities, serving a large model for every query is computationally expensive. MLLM cascades mitigate this cost by first querying a weak but cheaper model and deferring to a strong model when the weak model's output is unconfident. However, since the weak model's confidence directly controls compute allocation, these systems expose a new attack surface: an adversary can manipulate confidence so that their queries are consistently deferred to the strong model. Motivated by this vulnerability, we introduce the Forced Deferral Attack (FDA), an adversarial image attack that lowers the weak model's confidence and causes cascades to route queries to the strong model. FDA learns a universal border trigger by optimizing a temperature-flattened objective. This objective pushes the weak model's token distribution on triggered inputs toward less concentrated targets constructed from its clean responses. Across datasets, model families, and deferral metrics, FDA consistently increases strong-model routing while outperforming image-perturbation and prompt-injection baselines. These results show that MLLM cascades are vulnerable to attacks that manipulate compute allocation, forcing unintended strong-model usage without directly targeting answer correctness.
翻译:尽管多模态大语言模型(MLLMs)展现出强大的视觉推理能力,但为每个查询部署大型模型在计算上成本高昂。MLLM级联通过先查询表现较弱但成本较低的模型,并在弱模型输出置信度不足时转交至强模型,从而降低这种成本。然而,由于弱模型的置信度直接控制计算分配,这类系统暴露了新的攻击面:攻击者可操控置信度,使其查询被持续转交至强模型。基于此漏洞,我们提出强制延迟攻击(FDA),这是一种对抗性图像攻击,通过降低弱模型置信度,导致级联将查询路由至强模型。FDA通过优化温度展平目标函数,学习通用边界触发器。该目标函数将弱模型在触发输入上的令牌分布,推向基于其干净响应构建的、集中度较低的目标。跨数据集、模型族和延迟指标,FDA在持续提升强模型路由率的同时,优于图像扰动和提示注入基线。结果表明,MLLM级联易受操控计算分配的攻击威胁,从而在不直接针对答案正确性的情况下,强制使用强模型。