The widespread use of large language models (LLMs) and open-source code has raised ethical and security concerns regarding the distribution and attribution of source code, including unauthorized redistribution, license violations, and misuse of code for malicious purposes. Watermarking has emerged as a promising solution for source attribution, but existing techniques rely heavily on hand-crafted transformation rules, abstract syntax tree (AST) manipulation, or task-specific training, limiting their scalability and generality across languages. Moreover, their robustness against attacks remains limited. To address these limitations, we propose CodeMark-LLM, an LLM-driven watermarking framework that embeds watermark into source code without compromising its semantics or readability. CodeMark-LLM consists of two core components: (i) Semantically Consistent Embedding module that applies functionality-preserving transformations to encode watermark bits, and (ii) Differential Comparison Extraction module that identifies the applied transformations by comparing the original and watermarked code. Leveraging the cross-lingual generalization ability of LLM, CodeMark-LLM avoids language-specific engineering and training pipelines. Extensive experiments across diverse programming languages and attack scenarios demonstrate its robustness, effectiveness, and scalability.
翻译:大型语言模型(LLMs)和开源代码的广泛使用引发了关于源代码分发和归属的伦理与安全问题,包括未经授权的再分发、许可证违规以及将代码用于恶意目的。水印技术已成为源代码归属的一种有前景的解决方案,但现有方法严重依赖于手工制定的转换规则、抽象语法树(AST)操作或针对特定任务的训练,这限制了其跨语言的可扩展性和通用性。此外,它们对抗攻击的鲁棒性仍然有限。为解决这些局限性,我们提出了CodeMark-LLM,一个由LLM驱动的水印嵌入框架,可在不损害源代码语义或可读性的前提下嵌入水印。CodeMark-LLM包含两个核心组件:(i)语义一致嵌入模块,通过应用保持功能不变的转换来编码水印比特;(ii)差分比较提取模块,通过比较原始代码与水印代码来识别所应用的转换。利用LLM的跨语言泛化能力,CodeMark-LLM避免了针对特定语言的工程设计和训练流程。在多种编程语言和攻击场景下的大量实验证明了其鲁棒性、有效性和可扩展性。