The expansion of the open source community and the rise of large language models have raised ethical and security concerns on the distribution of source code, such as misconduct on copyrighted code, distributions without proper licenses, or misuse of the code for malicious purposes. Hence it is important to track the ownership of source code, in which watermarking is a major technique. Yet, drastically different from natural languages, source code watermarking requires far stricter and more complicated rules to ensure the readability as well as the functionality of the source code. Hence we introduce SrcMarker, a watermarking system to unobtrusively encode ID bitstrings into source code, without affecting the usage and semantics of the code. To this end, SrcMarker performs transformations on an AST-based intermediate representation that enables unified transformations across different programming languages. The core of the system utilizes learning-based embedding and extraction modules to select rule-based transformations for watermarking. In addition, a novel feature-approximation technique is designed to tackle the inherent non-differentiability of rule selection, thus seamlessly integrating the rule-based transformations and learning-based networks into an interconnected system to enable end-to-end training. Extensive experiments demonstrate the superiority of SrcMarker over existing methods in various watermarking requirements.
翻译:开源社区的扩展与大型语言模型的兴起引发了源代码分发中的伦理与安全问题,包括对受版权保护代码的不当使用、无合规许可证的传播,以及代码被恶意滥用等。因此,追踪源代码的所有权至关重要,其中水印技术是主要手段。然而,与自然语言截然不同,源代码水印需要更严格且更复杂的规则来确保代码的可读性与功能性。为此,我们提出SrcMarker——一种水印系统,可在不影响代码使用与语义的前提下,将身份比特串无感嵌入源代码中。为实现这一目标,SrcMarker在基于抽象语法树的中间表示上进行变换,从而支持跨编程语言的统一转换。该系统的核心采用基于学习的嵌入与提取模块,选择基于规则的水印变换。此外,我们设计了一种新颖的特征近似技术来应对规则选择固有的不可微性问题,从而将基于规则的变换与学习网络无缝集成,形成可端到端训练的互联系统。大量实验表明,SrcMarker在多种水印需求下均优于现有方法。