Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose \textbf{LaSM}, a \textit{Layer-wise Scaling Mechanism} that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant regions without additional training. Extensive experiments across multiple datasets demonstrate that our method significantly improves the defense success rate and exhibits strong robustness, while having negligible impact on the model's general capabilities. Our findings reveal that attention misalignment is a core vulnerability in MLLM agents and can be effectively addressed through selective layer-wise modulation. Our code can be found in https://github.com/YANGTUOMAO/LaSM.
翻译:基于多模态大语言模型(MLLMs)的图形用户界面(GUI)智能体在屏幕交互任务中展现出强大的决策能力,但仍极易受到基于弹出式窗口的环境注入攻击——恶意视觉元素会转移模型注意力,导致不安全或错误的操作。现有防御方法要么需要昂贵的重新训练,要么在归纳干扰下性能欠佳。本文系统研究了此类攻击如何改变GUI智能体的注意力行为,揭示了正确输出与错误输出之间的层级注意力发散模式。基于此发现,我们提出**LaSM**(层级缩放机制),该机制能选择性增强关键层中的注意力与MLP模块。LaSM无需额外训练即可提升模型显著性区域与任务相关区域的对齐程度。跨多个数据集的广泛实验表明,我们的方法显著提高了防御成功率并展现出强鲁棒性,同时对模型通用能力的影响可忽略不计。我们的发现表明,注意力错位是MLLM智能体的核心脆弱性,可通过选择性层级调制有效解决。代码见:https://github.com/YANGTUOMAO/LaSM。