Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose \textbf{LaSM}, a \textit{Layer-wise Scaling Mechanism} that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across multiple datasets demonstrate that our method significantly improves the defense success rate and exhibits strong robustness, while having negligible impact on the model's general capabilities. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation. Our code can be found in https://github.com/YANGTUOMAO/LaSM.
翻译:基于多模态大语言模型(MLLM)的图形用户界面(GUI)代理近期在屏幕交互任务中展现出强大的决策能力。然而,它们仍极易遭受基于弹窗的环境注入攻击,恶意视觉元素会转移模型注意力并导致不安全或错误的行为。现有防御方法要么需要高成本重训练,要么在归纳干扰下表现不佳。在本文中,我们系统研究了此类攻击如何改变GUI代理的注意力行为,并发现正确输出与错误输出之间存在层式注意力发散模式。基于此洞察,我们提出\textbf{LaSM}——一种\textit{层式缩放机制},可选择性地放大关键层中的注意力和MLP模块。LaSM无需额外训练即可增强模型显著性与任务相关区域的对齐度。跨多数据集的广泛实验表明,我们的方法显著提升了防御成功率并展现出强鲁棒性,同时几乎不影响模型的通用能力。我们的发现揭示了注意力错位是MLLM代理的核心脆弱点,可通过选择性层式调制有效应对。代码可在https://github.com/YANGTUOMAO/LaSM获取。