A vision encoder compresses image pixels into semantic embeddings, implicitly acting as a privacy boundary by preserving semantic content while attenuating pixel-local detail required for exact text recovery. Encoder-free vision-language models (VLMs) remove this boundary by routing image patches directly into the language-model token stream, thereby exposing an architectural privacy attack surface: intermediate visual tokens become a pre-output side channel. Under a token-access adversary, decoders invert visual-token streams from two encoder-free VLMs, Gemma4 and Fuyu, recovering recognizable image structure and readable held-out access codes, whereas matched encoder-based controls localize target regions but recover no exact strings. Within-model ablations show that the operative factor is spatial sampling fidelity of the visual-token grid, especially character-direction sampling density, rather than token or value count. The leakage is not limited to exported tokens: Gemma4 layer-0 key-value cache tensors are directly invertible, placing the side channel within KV caches commonly persisted by production serving stacks for decoding efficiency. The attack survives clutter, realistic document degradation, and zero-shot transfer to public document images, and it resists value-level defenses such as additive noise and quantization. Effective mitigation must therefore reduce spatial sampling, making removal of the vision encoder a first-class privacy decision in VLM deployment.
翻译:视觉编码器将图像像素压缩为语义嵌入,通过保留语义内容的同时削弱文本精确恢复所需的像素级细节,隐式充当了隐私边界。无编码器视觉-语言模型(VLM)通过将图像块直接路由至语言模型令牌流中移除了这一边界,从而暴露了架构层面的隐私攻击面:中间视觉令牌成为预输出侧信道。在令牌访问对抗者场景下,解码器对Gemma4和Fuyu两个无编码器VLM的视觉令牌流进行逆向还原,恢复出可识别的图像结构及可读的保留访问码;而匹配的编码器对照组仅能定位目标区域却无法恢复精确字符串。模型内消融实验表明,关键因素在于视觉令牌网格的空间采样保真度(特别是字符方向采样密度),而非令牌或数值数量。该泄露并不局限于导出的令牌:Gemma4第0层键值缓存张量可直接逆转换,将侧信道置于生产服务堆栈为提升解码效率而持久化存储的KV缓存中。该攻击能抵御杂乱背景、真实文档退化及零样本迁移至公开文档图像,并可抵抗加性噪声与量化等数值级防御。因此有效缓解措施必须降低空间采样,使得移除视觉编码器成为VLM部署中的一级隐私决策。