The capabilities of large language models have grown significantly in recent years and so too have concerns about their misuse. In this context, the ability to distinguish machine-generated text from human-authored content becomes important. Prior works have proposed numerous schemes to watermark text, which would benefit from a systematic evaluation framework. This work focuses on text watermarking techniques - as opposed to image watermarks - and proposes a comprehensive benchmark for them under different tasks as well as practical attacks. We focus on three main metrics: quality, size (e.g. the number of tokens needed to detect a watermark), and tamper-resistance. Current watermarking techniques are good enough to be deployed: Kirchenbauer et al. can watermark Llama2-7B-chat with no perceivable loss in quality in under 100 tokens, and with good tamper-resistance to simple attacks, regardless of temperature. We argue that watermark indistinguishability is too strong a requirement: schemes that slightly modify logit distributions outperform their indistinguishable counterparts with no noticeable loss in generation quality. We publicly release our benchmark.
翻译:近年来,大型语言模型的能力显著提升,同时对其滥用的担忧也与日俱增。在此背景下,区分机器生成文本与人类创作内容的能力变得尤为重要。已有研究提出了多种文本水印方案,但这些方案亟需一套系统化的评估框架。本文聚焦文本水印技术(区别于图像水印),针对不同任务场景及实际攻击方式,提出了一套综合性基准评估体系。我们主要关注三个核心指标:质量、规模(例如检测水印所需的token数量)及抗篡改能力。现有水印技术已具备部署可行性:Kirchenbauer等人提出的方案可在无需感知质量损失的前提下,用不足100个token对Llama2-7B-chat模型实施水印嵌入,且对简单攻击具有良好抗性(无论温度参数如何)。我们认为水印不可区分性要求过于严苛:适度修改logit分布的水印方案在生成质量无明显损失的情况下,其表现优于追求不可区分的方案。我们已公开发布该基准评估框架。