We consider the emerging problem of identifying the presence and use of watermarking schemes in widely used, publicly hosted, closed source large language models (LLMs). We introduce a suite of baseline algorithms for identifying watermarks in LLMs that rely on analyzing distributions of output tokens and logits generated by watermarked and unmarked LLMs. Notably, watermarked LLMs tend to produce distributions that diverge qualitatively and identifiably from standard models. Furthermore, we investigate the identifiability of watermarks at varying strengths and consider the tradeoffs of each of our identification mechanisms with respect to watermarking scenario. Along the way, we formalize the specific problem of identifying watermarks in LLMs, as well as LLM watermarks and watermark detection in general, providing a framework and foundations for studying them.
翻译:我们考虑在广泛使用、公开托管、闭源的大型语言模型(LLMs)中识别水印方案存在性与使用情况这一新兴问题。我们提出了一套用于识别LLMs水印的基线算法,这些算法通过分析有水印和无水印LLMs生成的输出标记分布及对数概率来进行识别。值得注意的是,有水印的LLMs倾向于生成与标准模型存在定性且可辨识差异的分布。此外,我们探讨了不同强度水印的可识别性,并权衡了每种识别机制在水印场景下的性能。在此过程中,我们形式化定义了LLMs水印识别的特定问题,以及LLMs水印及其检测的总体框架,为相关研究提供了系统化基础。