To mitigate the potential misuse of large language models (LLMs), recent research has developed watermarking algorithms, which restrict the generation process to leave an invisible trace for watermark detection. Due to the two-stage nature of the task, most studies evaluate the generation and detection separately, thereby presenting a challenge in unbiased, thorough, and applicable evaluations. In this paper, we introduce WaterBench, the first comprehensive benchmark for LLM watermarks, in which we design three crucial factors: (1) For benchmarking procedure, to ensure an apples-to-apples comparison, we first adjust each watermarking method's hyper-parameter to reach the same watermarking strength, then jointly evaluate their generation and detection performance. (2) For task selection, we diversify the input and output length to form a five-category taxonomy, covering $9$ tasks. (3) For evaluation metric, we adopt the GPT4-Judge for automatically evaluating the decline of instruction-following abilities after watermarking. We evaluate $4$ open-source watermarks on $2$ LLMs under $2$ watermarking strengths and observe the common struggles for current methods on maintaining the generation quality. The code and data are available at https://github.com/THU-KEG/WaterBench.
翻译:为缓解大语言模型(LLMs)的潜在滥用风险,近期研究提出了多种水印算法,通过约束生成过程嵌入隐形标记以供水印检测。由于该任务具有两阶段特性,现有研究多将生成与检测环节割裂评估,导致难以进行公正、全面且实用的性能评测。本文提出首个大语言模型水印综合基准WaterBench,其设计涵盖三个关键维度:(1)在评测流程方面,为确保公平比较,我们首先调整各水印方法的超参数至相同水印强度,继而对其生成与检测性能进行联合评估。(2)在任务设计方面,通过多样化输入输出长度构建包含$9$项任务的五级分类体系。(3)在评估指标方面,采用GPT4-Judge自动量化水印植入后指令跟随能力的衰减程度。我们在$2$种大语言模型上以$2$种水印强度测试了$4$种开源水印方案,发现现有方法普遍存在生成质量保持的挑战。代码与数据已开源:https://github.com/THU-KEG/WaterBench。