The recent advancements in large language models (LLMs) have sparked a growing apprehension regarding the potential misuse. One approach to mitigating this risk is to incorporate watermarking techniques into LLMs, allowing for the tracking and attribution of model outputs. This study examines a crucial aspect of watermarking: how significantly watermarks impact the quality of model-generated outputs. Previous studies have suggested a trade-off between watermark strength and output quality. However, our research demonstrates that it is possible to integrate watermarks without affecting the output probability distribution with appropriate implementation. We refer to this type of watermark as an unbiased watermark. This has significant implications for the use of LLMs, as it becomes impossible for users to discern whether a service provider has incorporated watermarks or not. Furthermore, the presence of watermarks does not compromise the performance of the model in downstream tasks, ensuring that the overall utility of the language model is preserved. Our findings contribute to the ongoing discussion around responsible AI development, suggesting that unbiased watermarks can serve as an effective means of tracking and attributing model outputs without sacrificing output quality.
翻译:近期大型语言模型的进展引发了对潜在滥用行为的日益担忧。降低这一风险的方法之一是在大型语言模型中嵌入水印技术,从而能够追踪和归因模型输出。本研究探讨了水印的一个关键方面:水印对模型生成输出质量的影响程度。以往研究认为水印强度与输出质量之间存在权衡关系。然而,我们的研究表明,通过适当实现,可以在不影响输出概率分布的情况下嵌入水印。我们将此类水印称为无偏水印。这对大型语言模型的使用具有重要影响,因为用户将无法辨别服务提供商是否已嵌入水印。此外,水印的存在不会损害模型在下游任务中的性能,从而确保语言模型的整体效用得以保留。我们的发现为关于负责任人工智能发展的持续讨论提供了贡献,表明无偏水印可以在不牺牲输出质量的前提下,成为追踪和归因模型输出的有效手段。