Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases, while some semantic-aware methods incur prohibitive inference latency due to rejection sampling. In this paper, we propose the VIsual Semantic Adaptive Watermark (VISA-Mark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. Our approach employs a lightweight, efficiently trained prefix-tuner to extract dynamic Visual-Evidence Weights, which quantify the evidentiary support for candidate tokens based on the visual input. These weights guide an adaptive vocabulary partitioning and logits perturbation mechanism, concentrating watermark strength specifically on visually-supported tokens. By actively aligning the watermark with visual evidence, VISA-Mark effectively maintains visual fidelity. Empirical results confirm that VISA-Mark outperforms conventional methods with a 7.8% improvement in visual consistency (Chair-I) and superior semantic fidelity. The framework maintains highly competitive detection accuracy (96.88% AUC) and robust attack resilience (99.3%) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multimodal watermarking.
翻译:水印技术已成为大规模视觉语言模型内容溯源与知识产权保护的关键解决方案。然而,视觉无关水印会引入视觉无关的标记,并通过施加无差别的伪随机偏差破坏视觉基础;而部分语义感知方法因拒绝采样导致推理延迟过高。本文提出视觉语义自适应水印框架,该创新框架在严格保持视觉保真度的同时嵌入可检测信号。本方法采用轻量级高效训练的前缀调优器提取动态视觉证据权重,该权重基于视觉输入量化候选标记的证据支持度。这些权重引导自适应词汇划分与对数扰动机制,将水印强度集中作用于视觉支持的标记。通过主动将水印与视觉证据对齐,VISA-Mark有效保持了视觉保真度。实证结果表明:VISA-Mark在视觉一致性指标上较传统方法提升7.8%,并具有更优的语义保真度;该框架在保持推理效率的同时,实现了极具竞争力的检测准确率和鲁棒的抗攻击能力,为可靠性保持的多模态水印技术确立了新标准。