The proliferation of open-source code and large language models (LLMs) for code generation has amplified the risks of unauthorized reuse and intellectual property infringement. Source code watermarking offers a potential solution, yet existing methods typically encode watermarks through identifiers, local code patterns, or limited handcrafted edits, leaving them vulnerable to renaming, refactoring, and adaptive watermark removal. These limitations hinder the joint achievement of robustness, capacity, generalization, and deployment efficiency. We propose CLASP, a Code LLM-Assisted Semantic-Preserving watermarking framework that enables training-free, plug-and-play watermarking for source code. CLASP embeds watermark bits within a fixed space of semantics-preserving transformations, enabling automated watermark insertion with higher capacity while remaining reusable across programming languages and less dependent on brittle lexical features. To recover the watermark, CLASP uses reference-code retrieval and differential comparison to identify transformation traces, avoiding task-specific model training while improving robustness to structural edits and adaptive attacks. Experiments across multiple programming languages show that CLASP consistently outperforms existing baselines in watermark extraction accuracy and robustness, while maintaining code quality under both random removal and adaptive de-watermarking attacks.
翻译:随着开源代码的激增以及用于代码生成的大语言模型的普及,未经授权的重用与知识产权侵权的风险显著增加。源代码水印提供了一种潜在的解决方案,然而现有方法通常通过标识符、局部代码模式或有限的手工编辑来嵌入水印,这使得它们易受重命名、重构以及自适应水印移除攻击的影响。这些局限性阻碍了在鲁棒性、容量、泛化能力和部署效率方面同时取得进展。我们提出CLASP,一种代码大语言模型辅助的语义保持水印框架,能够为源代码实现无训练、即插即用的水印嵌入。CLASP将水印比特嵌入到一个固定的语义保持变换空间中,从而能够以更高的容量自动插入水印,同时在不同的编程语言间可复用,且较少依赖脆弱的词法特征。为了提取水印,CLASP采用参考代码检索与差异比较来识别变换痕迹,避免了训练专用模型的同时,提升了对结构编辑和自适应攻击的鲁棒性。跨多种编程语言的实验表明,在随机移除和自适应去水印攻击下,CLASP在水印提取准确率和鲁棒性方面均持续优于现有基线方法,同时保持了代码质量。