Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging. In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones. We explore a number of hybrid schemes that combine the two, observing performance gains over either class of detector under a wide range of experimental conditions.
翻译:水印技术近期已成为检测大型语言模型(LLM)生成内容的一种有效策略。水印的强度通常在很大程度上取决于语言模型所提供的熵以及输入提示集合。然而,在实际应用中,熵可能相当有限,特别是对于经过后训练的模型(例如通过指令微调或基于人类反馈的强化学习(RLHF)),这使得仅依赖水印进行检测具有挑战性。在本研究中,我们探讨了通过将水印检测器与非水印检测器相结合是否能够提升检测性能。我们探索了多种融合二者的混合方案,发现在广泛的实验条件下,这些方案相较于单一类型的检测器均能取得性能提升。