Watermarking has emerged as a crucial technique for detecting and attributing content generated by large language models. While recent advancements have utilized watermark ensembles to enhance robustness, prevailing methods typically prioritize maximizing the strength of the watermark at every individual layer. In this work, we identify a critical limitation in this "stronger-is-better" approach: strong watermarks significantly reduce the entropy of the token distribution, which paradoxically weakens the effectiveness of watermarking in subsequent layers. We theoretically and empirically show that detectability is bounded by entropy and that watermark ensembles induce a monotonic decrease in both entropy and the expected green-list ratio across layers. To address this inherent trade-off, we propose a general framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling. Empirical evaluations demonstrate that this counter-intuitive strategy mitigates signal decay and consistently outperforms strong baselines in both detectability and robustness.
翻译:水印技术已成为检测和溯源大语言模型生成内容的关键方法。尽管近期研究利用水印集成来增强鲁棒性,但主流方法通常优先追求最大化每一独立层的水印强度。本研究发现这种"越强越好"策略存在关键局限:强水印会显著降低词元分布的熵值,这种效应反而会削弱后续层的水印有效性。我们通过理论与实验证明,可检测性受熵值约束,且水印集成会导致熵值与预期绿表比率随层数增加呈单调递减。为应对这一固有权衡,我们提出一个通用框架,通过采用更弱的单层水印来保持有效多层集成所需的熵值。实证评估表明,这一反直觉策略能有效缓解信号衰减,在可检测性与鲁棒性方面均持续优于现有强基线方法。