Statistical watermarking techniques are well-established for sequentially decoded language models (LMs). However, these techniques cannot be directly applied to order-agnostic LMs, as the tokens in order-agnostic LMs are not generated sequentially. In this work, we introduce Pattern-mark, a pattern-based watermarking framework specifically designed for order-agnostic LMs. We develop a Markov-chain-based watermark generator that produces watermark key sequences with high-frequency key patterns. Correspondingly, we propose a statistical pattern-based detection algorithm that recovers the key sequence during detection and conducts statistical tests based on the count of high-frequency patterns. Our extensive evaluations on order-agnostic LMs, such as ProteinMPNN and CMLM, demonstrate Pattern-mark's enhanced detection efficiency, generation quality, and robustness, positioning it as a superior watermarking technique for order-agnostic LMs.
翻译:统计水印技术已在顺序解码的语言模型中得到广泛应用。然而,这些技术无法直接应用于无序语言模型,因为无序语言模型中的标记并非按顺序生成。本研究提出了Pattern-mark——一种专门为无序语言模型设计的基于模式的水印框架。我们开发了一种基于马尔可夫链的水印生成器,能够生成具有高频关键模式的水印密钥序列。相应地,我们提出了一种基于模式的统计检测算法,该算法在检测过程中恢复密钥序列,并根据高频模式的出现次数进行统计检验。通过对ProteinMPNN和CMLM等无序语言模型的大量评估,我们证明了Pattern-mark在检测效率、生成质量和鲁棒性方面的优越性,使其成为无序语言模型领域更优异的水印技术。