To mitigate the potential misuse of large language models (LLMs), recent research has developed watermarking algorithms, which restrict the generation process to leave an invisible trace for watermark detection. Due to the two-stage nature of the task, most studies evaluate the generation and detection separately, thereby presenting a challenge in unbiased, thorough, and applicable evaluations. In this paper, we introduce WaterBench, the first comprehensive benchmark for LLM watermarks, in which we design three crucial factors: (1) For \textbf{benchmarking procedure}, to ensure an apples-to-apples comparison, we first adjust each watermarking method's hyper-parameter to reach the same watermarking strength, then jointly evaluate their generation and detection performance. (2) For \textbf{task selection}, we diversify the input and output length to form a five-category taxonomy, covering $9$ tasks. (3) For \textbf{evaluation metric}, we adopt the GPT4-Judge for automatically evaluating the decline of instruction-following abilities after watermarking. We evaluate $4$ open-source watermarks on $2$ LLMs under $2$ watermarking strengths and observe the common struggles for current methods on maintaining the generation quality. The code and data are available at \url{https://github.com/THU-KEG/WaterBench}.
翻译:为缓解大型语言模型(LLMs)的潜在滥用问题,近期研究开发了水印算法,通过限制生成过程以留下不可见痕迹用于水印检测。由于任务具有两阶段特性,多数研究将生成与检测分离评估,导致评估存在偏差性、片面性和应用局限性。本文提出WaterBench——首个面向LLM水印的综合性基准,我们设计了三项关键因素:(1)**基准测试流程**:为确保公平比较,首先调整每种水印方法的超参数以达到相同的水印强度,再联合评估其生成与检测性能;(2)**任务选择**:通过多样化输入与输出长度,构建包含9项任务的五分类体系;(3)**评估指标**:采用GPT4-Judge自动评估水印后指令遵循能力的下降程度。我们在2种水印强度下,对4种开源水印方法在2个LLM上进行了评估,观察到当前方法在保持生成质量方面普遍存在困难。代码与数据详见 \url{https://github.com/THU-KEG/WaterBench}。