The expansion of the open source community and the rise of large language models have raised ethical and security concerns on the distribution of source code, such as misconduct on copyrighted code, distributions without proper licenses, or misuse of the code for malicious purposes. Hence it is important to track the ownership of source code, in wich watermarking is a major technique. Yet, drastically different from natural languages, source code watermarking requires far stricter and more complicated rules to ensure the readability as well as the functionality of the source code. Hence we introduce SrcMarker, a watermarking system to unobtrusively encode ID bitstrings into source code, without affecting the usage and semantics of the code. To this end, SrcMarker performs transformations on an AST-based intermediate representation that enables unified transformations across different programming languages. The core of the system utilizes learning-based embedding and extraction modules to select rule-based transformations for watermarking. In addition, a novel feature-approximation technique is designed to tackle the inherent non-differentiability of rule selection, thus seamlessly integrating the rule-based transformations and learning-based networks into an interconnected system to enable end-to-end training. Extensive experiments demonstrate the superiority of SrcMarker over existing methods in various watermarking requirements.
翻译:开源社区的扩张以及大语言模型的兴起,引发了源代码分发中的伦理与安全问题,例如对受版权代码的不当使用、未经适当许可的分发,或恶意利用代码进行非法活动。因此,追踪源代码的所有权至关重要,而水印技术是实现这一目标的主要手段。然而,与自然语言截然不同,源代码水印需要更严格且更复杂的规则以确保代码的可读性和功能性。为此,我们提出SrcMarker——一种水印系统,能在不影响代码使用与语义的前提下,将ID比特串无感知地嵌入源代码中。为实现此目标,SrcMarker对基于抽象语法树的中间表示进行转换,该表示支持跨不同编程语言的统一转换。系统的核心利用基于学习的嵌入与提取模块,选择基于规则的水印转换。此外,我们设计了一种新颖的特征近似技术,以解决规则选择中固有的不可微分性,从而将基于规则的转换与基于学习的网络无缝集成至一个互联系统,实现端到端训练。大量实验表明,SrcMarker在多种水印需求下均优于现有方法。