Various watermarking methods (``watermarkers'') have been proposed to identify LLM-generated texts; yet, due to the lack of unified evaluation platforms, many critical questions remain under-explored: i) What are the strengths/limitations of various watermarkers, especially their attack robustness? ii) How do various design choices impact their robustness? iii) How to optimally operate watermarkers in adversarial environments? To fill this gap, we systematize existing LLM watermarkers and watermark removal attacks, mapping out their design spaces. We then develop WaterPark, a unified platform that integrates 10 state-of-the-art watermarkers and 12 representative attacks. More importantly, by leveraging WaterPark, we conduct a comprehensive assessment of existing watermarkers, unveiling the impact of various design choices on their attack robustness. We further explore the best practices to operate watermarkers in adversarial environments. We believe our study sheds light on current LLM watermarking techniques while WaterPark serves as a valuable testbed to facilitate future research.
翻译:为识别大语言模型生成的文本,研究者已提出多种水印方法(“水印器”);然而,由于缺乏统一的评估平台,许多关键问题仍未得到充分探索:i)各类水印器的优势与局限何在,特别是其抗攻击鲁棒性如何?ii)不同设计选择如何影响其鲁棒性?iii)在对抗性环境中应如何优化水印器的操作?为填补这一空白,我们对现有大语言模型水印器与水印移除攻击进行了系统化梳理,并绘制了其设计空间图谱。基于此,我们开发了WaterPark——一个集成了10种前沿水印器与12种代表性攻击的统一平台。更重要的是,借助WaterPark,我们对现有水印器开展了全面评估,揭示了不同设计选择对其抗攻击鲁棒性的影响。我们进一步探索了在对抗性环境中操作水印器的最佳实践。我们相信本研究为当前大语言模型水印技术提供了新的见解,而WaterPark将作为促进未来研究的重要测试平台。