SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Web agents have emerged as an effective paradigm for automating interactions with complex web environments, yet remain vulnerable to prompt injection attacks that embed malicious instructions into webpage content to induce unintended actions. This threat is further amplified for screenshot-based web agents, which operate on rendered visual webpages rather than structured textual representations, making predominant text-centric defenses ineffective. Although multimodal detection methods have been explored, they often rely on large vision-language models (VLMs), incurring significant computational overhead. The bottleneck lies in the complexity of modern webpages: VLMs must comprehend the global semantics of an entire page, resulting in substantial inference time and GPU memory usage. This raises a critical question: can we detect prompt injection attacks from screenshots in a lightweight manner? In this paper, we observe that injected webpages exhibit distinct characteristics compared to benign ones from both visual and textual perspectives. Building on this insight, we propose SnapGuard, a lightweight yet accurate method that reformulates prompt injection detection as multimodal representation analysis over webpage screenshots. SnapGuard leverages two complementary signals: a visual stability indicator that identifies abnormally smooth gradient distributions induced by malicious content, and action-oriented textual signals recovered via contrast-polarity reversal. Extensive evaluations across eight attacks and two benign settings demonstrate that SnapGuard achieves an F1 score of 0.75, outperforming GPT-4o-prompt while being 8x faster (1.81s vs. 14.50s) and introducing no additional memory overhead.

翻译：网页代理已成为自动化复杂网页环境交互的有效范式，但其仍易受提示注入攻击——攻击者将恶意指令嵌入网页内容以诱导代理执行非预期操作。这种威胁对基于截屏的网页代理更为严峻，因其处理的是渲染后的可视化网页而非结构化文本表示，导致主流文本防御机制失效。尽管多模态检测方法已被探索，但它们往往依赖大型视觉语言模型（VLM），带来巨大算力开销。瓶颈在于现代网页的复杂性：VLM必须理解整页的全局语义，导致推理时间与GPU内存消耗激增。由此引发关键问题：能否以轻量方式从截屏中检测提示注入攻击？本文观察到，被注入网页在视觉与文本维度均呈现与良性网页不同的特征。基于此发现，我们提出SnapGuard——一种将提示注入检测重构为网页截屏多模态表征分析的轻量精准方法。SnapGuard利用两种互补信号：视觉稳定性指标（识别恶意内容引发的异常平滑梯度分布）与通过对比极性反转恢复的面向操作的文本信号。在八种攻击场景与两种良性环境下的广泛评估表明，SnapGuard的F1分数达0.75，性能超越GPT-4o-prompt的同时，推理速度提升8倍（1.81秒对比14.50秒），且未引入额外内存开销。