This study aims to proactively tackle misuse of large language models beyond identification of machine-generated text. While existing methods focus on detection, some malicious misuses demand tracing the adversary user for counteracting them. To address this, we propose "Multi-bit Watermark through Color-listing" (COLOR), embedding traceable multi-bit information during language model generation. Leveraging the benefits of zero-bit watermarking (Kirchenbauer et al., 2023a), COLOR enables extraction without model access, on-the-fly embedding, and maintains text quality, while allowing zero-bit detection all at the same time. Preliminary experiments demonstrates successful embedding of 32-bit messages with 91.9% accuracy in moderate-length texts ($\sim$500 tokens). This work advances strategies to counter language model misuse effectively.
翻译:本研究旨在主动应对大型语言模型的滥用问题,超越仅识别机器生成文本的范畴。现有方法主要聚焦于检测,而某些恶意滥用行为需要追溯对抗用户以进行反制。为此,我们提出“基于颜色列表的多比特水印”(COLOR),在语言模型生成过程中嵌入可追踪的多比特信息。借助零比特水印(Kirchenbauer et al., 2023a)的优势,COLOR无需访问模型即可提取、支持动态嵌入,并保持文本质量,同时能进行零比特检测。初步实验表明,在中等长度文本(约500个令牌)中,成功嵌入了32比特信息,准确率达91.9%。本研究推进了有效反制语言模型滥用的策略。