Recent advancements of large language models (LLMs) have resulted in indistinguishable text outputs comparable to human-generated text. Watermarking algorithms are potential tools that offer a way to differentiate between LLM- and human-generated text by embedding detectable signatures within LLM-generated output. However, current watermarking schemes lack robustness against known attacks against watermarking algorithms. In addition, they are impractical considering an LLM generates tens of thousands of text outputs per day and the watermarking algorithm needs to memorize each output it generates for the detection to work. In this work, focusing on the limitations of current watermarking schemes, we propose the concept of a "topic-based watermarking algorithm" for LLMs. The proposed algorithm determines how to generate tokens for the watermarked LLM output based on extracted topics of an input prompt or the output of a non-watermarked LLM. Inspired from previous work, we propose using a pair of lists (that are generated based on the specified extracted topic(s)) that specify certain tokens to be included or excluded while generating the watermarked output of the LLM. Using the proposed watermarking algorithm, we show the practicality of a watermark detection algorithm. Furthermore, we discuss a wide range of attacks that can emerge against watermarking algorithms for LLMs and the benefit of the proposed watermarking scheme for the feasibility of modeling a potential attacker considering its benefit vs. loss.
翻译:近年来,大语言模型(LLMs)的进步使得其生成的文本输出与人类撰写的文本几乎难以区分。水印算法作为潜在工具,可通过在LLM生成的输出中嵌入可检测的签名,为区分LLM生成文本与人类文本提供途径。然而,现有水印方案缺乏对抗已知水印攻击的鲁棒性。此外,考虑到LLM每天需生成数万条文本输出,且水印算法需记忆每一次生成的输出才能实现检测功能,现有方案在实际应用中存在局限。针对当前水印方案的不足,本文提出了一种面向LLM的"主题水印算法"概念。该算法基于输入提示词或未加印LLM输出中提取的主题,决定如何为加印LLM生成令牌。受前人研究启发,我们提出采用一对由特定提取主题生成的列表,通过指定加印LLM输出生成过程中需包含或排除的特定令牌实现水印嵌入。基于所提水印算法,我们论证了水印检测算法的实用性。此外,系统探讨了LLM水印算法面临的多种潜在攻击类型,并从攻击者收益与损失角度分析了所提水印方案在攻击行为建模中的可行性优势。