GUI agents rely on screenshots to infer intent and operate across applications, but these screenshots often contain private messages, medical records, payment credentials, and workplace-specific workflows. Privacy decisions in this setting depend on task, recipient, application state, and user role, yet static PII detectors miss these boundaries and cloud-side VLM reasoning can upload the raw screen before deciding what should be protected. We present MaskClaw, an edge-side privacy arbitrator for GUI agents. MaskClaw extracts local visual evidence, retrieves user- and task-specific policy memory, and decides Allow, Mask, or Ask before raw screenshots leave a trusted user- or organization-controlled environment. In five designed skill-evolution scenarios, it turns corrections, cancellations, and edits into reusable privacy skills checked by a sandbox gate. We introduce P-GUI-Evo, a benchmark built from real UI patterns, reconstructed HTML screens, and sanitized labels. Experiments show that pattern matching, cloud reasoning, and routing alone tend to over-confirm, over-mask, or expose raw screenshots under the same protocol. The artifact is available at https://github.com/Theodora-Y/MaskClaw.
翻译:GUI代理依赖截图推断意图并跨应用操作,但这些截图常包含私人消息、医疗记录、支付凭证及职场特定工作流。此类场景下的隐私决策取决于任务、接收方、应用状态及用户角色,然而静态PII检测器无法识别这些边界,而云端VLM推理可能在决定保护内容之前已上传原始截图。我们提出MaskClaw——面向GUI代理的边缘侧隐私仲裁器。该机制提取本地视觉证据、检索用户与任务相关的策略记忆,并在原始截图离开受信任的用户或组织控制环境之前,决定“允许”(Allow)、“遮蔽”(Mask)或“询问”(Ask)。在设计的五种技能演化场景中,它将修正、取消与编辑操作转化为可复用的隐私技能,并通过沙箱门(sandbox gate)进行验证。我们引入P-GUI-Evo基准测试集,该基准基于真实UI模式、重构的HTML屏幕及经脱敏处理的标签构建。实验表明,在相同协议下,单一的模式匹配、云端推理及路由策略易导致过度确认、过度遮蔽或暴露原始截图。相关代码已开源至https://github.com/Theodora-Y/MaskClaw。