With the success of autoregressive learning in large language models, it has become a dominant approach for text-to-image generation, offering high efficiency and visual quality. However, invisible watermarking for visual autoregressive (VAR) models remains underexplored, despite its importance in misuse prevention. Existing watermarking methods, designed for diffusion models, often struggle to adapt to the sequential nature of VAR models. To bridge this gap, we propose Safe-VAR, the first watermarking framework specifically designed for autoregressive text-to-image generation. Our study reveals that the timing of watermark injection significantly impacts generation quality, and watermarks of different complexities exhibit varying optimal injection times. Motivated by this observation, we propose an Adaptive Scale Interaction Module, which dynamically determines the optimal watermark embedding strategy based on the watermark information and the visual characteristics of the generated image. This ensures watermark robustness while minimizing its impact on image quality. Furthermore, we introduce a Cross-Scale Fusion mechanism, which integrates mixture of both heads and experts to effectively fuse multi-resolution features and handle complex interactions between image content and watermark patterns. Experimental results demonstrate that Safe-VAR achieves state-of-the-art performance, significantly surpassing existing counterparts regarding image quality, watermarking fidelity, and robustness against perturbations. Moreover, our method exhibits strong generalization to an out-of-domain watermark dataset QR Codes.
翻译:随着自回归学习在大型语言模型中的成功,它已成为文本到图像生成的主导方法,提供了高效率和卓越的视觉质量。然而,尽管其在防止滥用方面具有重要意义,针对视觉自回归(VAR)模型的不可见水印技术仍未得到充分探索。现有的水印方法主要为扩散模型设计,通常难以适应VAR模型的序列生成特性。为弥补这一空白,我们提出了Safe-VAR,这是首个专门为自回归文本到图像生成设计的水印框架。我们的研究表明,水印注入的时机对生成质量有显著影响,且不同复杂度的水印表现出不同的最优注入时机。基于这一观察,我们提出了一个自适应尺度交互模块,该模块根据水印信息和生成图像的视觉特征动态确定最优的水印嵌入策略。这确保了水印的鲁棒性,同时最小化其对图像质量的影响。此外,我们引入了跨尺度融合机制,该机制整合了多头和专家混合策略,以有效融合多分辨率特征并处理图像内容与水印模式之间的复杂交互。实验结果表明,Safe-VAR实现了最先进的性能,在图像质量、水印保真度以及对扰动的鲁棒性方面显著超越了现有方法。此外,我们的方法在域外水印数据集(如QR码)上表现出强大的泛化能力。