The capabilities of large language models have grown significantly in recent years and so too have concerns about their misuse. In this context, the ability to distinguish machine-generated text from human-authored content becomes important. Prior works have proposed numerous schemes to watermark text, which would benefit from a systematic evaluation framework. This work focuses on text watermarking techniques - as opposed to image watermarks - and proposes MARKMYWORDS, a comprehensive benchmark for them under different tasks as well as practical attacks. We focus on three main metrics: quality, size (e.g. the number of tokens needed to detect a watermark), and tamper-resistance. Current watermarking techniques are good enough to be deployed: Kirchenbauer et al. [1] can watermark Llama2-7B-chat with no perceivable loss in quality, the watermark can be detected with fewer than 100 tokens, and the scheme offers good tamper-resistance to simple attacks. We argue that watermark indistinguishability, a criteria emphasized in some prior works, is too strong a requirement: schemes that slightly modify logit distributions outperform their indistinguishable counterparts with no noticeable loss in generation quality. We publicly release our benchmark (https://github.com/wagner-group/MarkMyWords)
翻译:大型语言模型在近年来的能力显著提升,同时也引发了对其被滥用的担忧。在此背景下,区分机器生成文本与人类创作内容的能力变得至关重要。先前的研究提出了多种文本水印方案,但这些方案需要一个系统的评估框架。本研究聚焦于文本水印技术——而非图像水印——并提出了一个名为MARKMYWORDS的综合基准,用于评估这些水印在不同任务及实际攻击下的表现。我们主要关注三个指标:质量、规模(例如检测水印所需的令牌数量)以及抗篡改性。当前的水印技术已足够成熟以投入部署:Kirchenbauer等人[1]能够在Llama2-7B-chat上应用水印,且不造成可感知的质量损失,该水印可在少于100个令牌的情况下被检测到,并对简单攻击具有良好的抗篡改性。我们认为,水印不可区分性——这是某些先前工作中强调的标准——要求过于严格:那些略微修改logit分布的方案,其表现优于不可区分的对应方案,且不会显著影响生成质量。我们公开发布了该基准(https://github.com/wagner-group/MarkMyWords)。