PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

System prompts are critical for shaping the behavior and output quality of large language model (LLM)-based applications, driving substantial investment in optimizing high-quality prompts beyond traditional handcrafted designs. However, as system prompts become valuable intellectual property, they are increasingly vulnerable to prompt theft and unauthorized use, highlighting the urgent need for effective copyright auditing, especially watermarking. Existing methods rely on verifying subtle logit distribution shifts triggered by a query. We observe that this logit-dependent verification framework is impractical in real-world content-only settings, primarily because (1) random sampling makes content-level generation unstable for verification, and (2) stronger instructions needed for content-level signals compromise prompt fidelity. To overcome these challenges, we propose PromptCOS, the first content-only system prompt copyright auditing method based on content-level output similarity. PromptCOS achieves watermark stability by designing a cyclic output signal as the conditional instruction's target. It preserves prompt fidelity by injecting a small set of auxiliary tokens to encode the watermark, leaving the main prompt untouched. Furthermore, to ensure robustness against malicious removal, we optimize cover tokens, i.e., critical tokens in the original prompt, to ensure that removing auxiliary tokens causes severe performance degradation. Experimental results show that promptCOS achieves high effectiveness (99.3% average watermark similarity), strong distinctiveness (60.8% higher than the best baseline), high fidelity (accuracy degradation no greater than 0.6%), robustness (resilience against four potential attack categories), and high computational efficiency (up to 98.1% cost saving).

翻译：系统提示对于塑造基于大型语言模型应用的性能与输出质量至关重要，驱动着除传统手工设计外对高质量提示优化的巨额投入。然而，随着系统提示成为宝贵知识产权，其正面临日益严重的提示窃取与未授权使用风险，凸显出对有效版权审计（尤其是水印技术）的迫切需求。现有方法依赖验证查询引发的细微logit分布偏移。我们观察到这种依赖logit的验证框架在真实的纯内容场景中并不实用，主要原因在于：（1）随机抽样导致内容级生成结果不稳定，难以用于验证；（2）为获取内容级信号所需的更强指令会损害提示保真度。为应对这些挑战，我们提出PromptCOS——首个基于内容级输出相似度的纯内容系统提示版权审计方法。PromptCOS通过将循环输出信号设计为条件指令目标来实现水印稳定性，通过注入少量辅助令牌编码水印来保持提示保真度（主提示保持不变）。此外，为确保对恶意移除的鲁棒性，我们优化了掩护令牌（即原始提示中的关键令牌），使得移除辅助令牌会导致严重的性能下降。实验结果表明，PromptCOS实现了高有效性（平均水印相似度99.3%）、强区分性（较最优基线提升60.8%）、高保真度（精度下降不超过0.6%）、鲁棒性（可抵御四类潜在攻击）以及高计算效率（成本节约高达98.1%）。