Zero-Permission Manipulation: Can We Trust Large Multimodal Model Powered GUI Agents?

Large multimodal model powered GUI agents are emerging as high-privilege operators on mobile platforms, entrusted with perceiving screen content and injecting inputs. However, their design operates under the implicit assumption of Visual Atomicity: that the UI state remains invariant between observation and action. We demonstrate that this assumption is fundamentally invalid in Android, creating a critical attack surface. We present Action Rebinding, a novel attack that allows a seemingly-benign app with zero dangerous permissions to rebind an agent's execution. By exploiting the inevitable observation-to-action gap inherent in the agent's reasoning pipeline, the attacker triggers foreground transitions to rebind the agent's planned action toward the target app. We weaponize the agent's task-recovery logic and Android's UI state preservation to orchestrate programmable, multi-step attack chains. Furthermore, we introduce an Intent Alignment Strategy (IAS) that manipulates the agent's reasoning process to rationalize UI states, enabling it to bypass verification gates (e.g., confirmation dialogs) that would otherwise be rejected. We evaluate Action Rebinding Attacks on six widely-used Android GUI agents across 15 tasks. Our results demonstrate a 100% success rate for atomic action rebinding and the ability to reliably orchestrate multi-step attack chains. With IAS, the success rate in bypassing verification gates increases (from 0% to up to 100%). Notably, the attacker application requires no sensitive permissions and contains no privileged API calls, achieving a 0% detection rate across malware scanners (e.g., VirusTotal). Our findings reveal a fundamental architectural flaw in current agent-OS integration and provide critical insights for the secure design of future agent systems. To access experimental logs and demonstration videos, please contact [email protected].

翻译：基于大型多模态模型的GUI代理正在成为移动平台上的高权限操作者，被赋予感知屏幕内容和注入输入的能力。然而，其设计隐含地依赖于视觉原子性假设：即用户界面状态在观察与执行动作之间保持不变。我们证明这一假设在Android系统中从根本上不成立，从而形成了一个关键的受攻击面。我们提出了动作重绑定攻击，这是一种新颖的攻击方式，允许一个看似无害且不拥有任何危险权限的应用程序重定向代理的执行流程。通过利用代理推理管道中固有的、不可避免的观察-执行间隙，攻击者触发前台界面切换，从而将代理计划执行的动作重绑定至目标应用。我们利用代理的任务恢复逻辑和Android系统的UI状态保持机制，编排可编程的多步骤攻击链。此外，我们引入了一种意图对齐策略，该策略通过操控代理的推理过程来合理化UI状态，使其能够绕过原本会被拒绝的验证关卡（例如确认对话框）。我们在六款广泛使用的Android GUI代理上，针对15项任务评估了动作重绑定攻击。我们的结果表明，原子动作重绑定的成功率达到100%，并且能够可靠地编排多步骤攻击链。结合意图对齐策略后，绕过验证关卡的成功率显著提升（从0%最高提升至100%）。值得注意的是，攻击者应用无需任何敏感权限，也不包含任何特权API调用，在各类恶意软件扫描器（如VirusTotal）上的检测率为0%。我们的研究揭示了当前代理-操作系统集成中存在的一个根本性架构缺陷，并为未来代理系统的安全设计提供了关键见解。如需获取实验日志和演示视频，请联系[email protected]。