Zero-Permission Manipulation: Can We Trust Large Multimodal Model Powered GUI Agents?

Large multimodal model powered GUI agents are emerging as high-privilege operators on mobile platforms, entrusted with perceiving screen content and injecting inputs. However, their design operates under the implicit assumption of Visual Atomicity: that the UI state remains invariant between observation and action. We demonstrate that this assumption is fundamentally invalid in Android, creating a critical attack surface. We present Action Rebinding, a novel attack that allows a seemingly-benign app with zero dangerous permissions to rebind an agent's execution. By exploiting the inevitable observation-to-action gap inherent in the agent's reasoning pipeline, the attacker triggers foreground transitions to rebind the agent's planned action toward the target app. We weaponize the agent's task-recovery logic and Android's UI state preservation to orchestrate programmable, multi-step attack chains. Furthermore, we introduce an Intent Alignment Strategy (IAS) that manipulates the agent's reasoning process to rationalize UI states, enabling it to bypass verification gates (e.g., confirmation dialogs) that would otherwise be rejected. We evaluate Action Rebinding Attacks on six widely-used Android GUI agents across 15 tasks. Our results demonstrate a 100% success rate for atomic action rebinding and the ability to reliably orchestrate multi-step attack chains. With IAS, the success rate in bypassing verification gates increases (from 0% to up to 100%). Notably, the attacker application requires no sensitive permissions and contains no privileged API calls, achieving a 0% detection rate across malware scanners (e.g., VirusTotal). Our findings reveal a fundamental architectural flaw in current agent-OS integration and provide critical insights for the secure design of future agent systems. To access experimental logs and demonstration videos, please contact [email protected].

翻译：基于大型多模态模型的GUI智能体正逐渐成为移动平台上的高权限操作者，被赋予感知屏幕内容和注入输入的能力。然而，其设计隐含地依赖于视觉原子性假设：即用户界面状态在观察与执行动作之间保持不变。我们证明这一假设在Android系统中从根本上不成立，从而构成了一个关键的受攻击面。我们提出了一种名为动作重绑定的新型攻击方法，该方法允许一个看似无害、且不持有任何危险权限的应用程序，重新绑定智能体的执行过程。通过利用智能体推理流程中固有的、不可避免的观察-执行间隙，攻击者触发前台界面切换，从而将智能体计划执行的动作重新绑定到目标应用程序上。我们利用智能体的任务恢复逻辑以及Android系统的UI状态保持机制，编排可编程的多步骤攻击链。此外，我们引入了一种意图对齐策略，通过操控智能体的推理过程使其合理化UI状态，从而使其能够绕过原本会拒绝执行的验证关卡（例如确认对话框）。我们在六款广泛使用的Android GUI智能体上，针对15项任务评估了动作重绑定攻击。结果显示，原子动作重绑定的成功率达到100%，并且能够可靠地编排多步骤攻击链。在使用意图对齐策略后，绕过验证关卡的成功率显著提升（从0%提高至最高100%）。值得注意的是，攻击者应用程序无需任何敏感权限，也不包含任何特权API调用，在各类恶意软件扫描器（如VirusTotal）上的检测率为0%。我们的研究揭示了当前智能体-操作系统集成中存在的一个根本性架构缺陷，并为未来智能体系统的安全设计提供了关键见解。如需获取实验日志和演示视频，请联系 [email protected]。