Watermarking is a principled approach for tracing the provenance of large language model (LLM) outputs, but its deployment in practice is hindered by inference inefficiency. Speculative sampling accelerates inference, with efficiency improving as the acceptance rate between draft and target models increases. Yet recent work reveals a fundamental trade-off: higher watermark strength reduces acceptance, preventing their simultaneous achievement. We revisit this trade-off and show it is not absolute. We introduce a quantitative measure of watermark strength that governs statistical detectability and is maximized when tokens are deterministic functions of pseudorandom numbers. Using this measure, we fully characterize the trade-off as a constrained optimization problem and derive explicit Pareto curves for two existing watermarking schemes. Finally, we introduce a principled mechanism that injects pseudorandomness into draft-token acceptance, ensuring maximal watermark strength while maintaining speculative sampling efficiency. Experiments further show that this approach improves detectability without sacrificing efficiency. Our findings uncover a principle that unites speculative sampling and watermarking, paving the way for their efficient and practical deployment.
翻译:水印技术是追溯大型语言模型(LLM)输出来源的一种原则性方法,但其在实际部署中因推理效率低下而受阻。推测采样通过提高草稿模型与目标模型之间的接受率来加速推理,从而提升效率。然而,近期研究揭示了一个根本性权衡:水印强度的增强会降低接受率,使得二者无法同时实现。本文重新审视了这一权衡关系,并证明其并非绝对。我们引入了一种量化水印强度的度量方法,该方法主导统计可检测性,并在令牌成为伪随机数的确定性函数时达到最大化。利用这一度量,我们将该权衡关系完整刻画为一个约束优化问题,并为两种现有水印方案推导出显式的帕累托曲线。最后,我们提出一种原则性机制,将伪随机性注入草稿令牌的接受过程中,从而在保持推测采样效率的同时确保水印强度最大化。实验进一步表明,该方法能在不牺牲效率的前提下提升可检测性。我们的发现揭示了统一推测采样与水印技术的核心原理,为二者高效且实用的部署铺平了道路。