Beyond Dataset Watermarking: Model-Level Copyright Protection for Code Summarization Models

Code Summarization Model (CSM) has been widely used in code production, such as online and web programming for PHP and Javascript. CSMs are essential tools in code production, enhancing software development efficiency and driving innovation in automated code analysis. However, CSMs face risks of exploitation by unauthorized users, particularly in an online environment where CSMs can be easily shared and disseminated. To address these risks, digital watermarks offer a promising solution by embedding imperceptible signatures within the models to assert copyright ownership and track unauthorized usage. Traditional watermarking for CSM copyright protection faces two main challenges: 1) dataset watermarking methods require separate design of triggers and watermark features based on the characteristics of different programming languages, which not only increases the computation complexity but also leads to a lack of generalization, 2) existing watermarks based on code style transformation are easily identifiable by automated detection, demonstrating poor concealment. To tackle these issues, we propose ModMark , a novel model-level digital watermark embedding method. Specifically, by fine-tuning the tokenizer, ModMark achieves cross-language generalization while reducing the complexity of watermark design. Moreover, we employ code noise injection techniques to effectively prevent trigger detection. Experimental results show that our method can achieve 100% watermark verification rate across various programming languages' CSMs, and the concealment and effectiveness of ModMark can also be guaranteed.

翻译：代码摘要模型（CSM）已广泛应用于代码生产领域，例如PHP和JavaScript的在线及网络编程。CSM作为代码生产中的关键工具，能够提升软件开发效率并推动自动化代码分析的创新。然而，CSM面临着被未授权用户利用的风险，尤其是在在线环境中，模型极易被共享和传播。为解决这些风险，数字水印技术通过将不可感知的签名嵌入模型中以声明版权所有权并追踪未授权使用，提供了一种前景广阔的解决方案。传统的CSM版权保护水印技术面临两大挑战：1）数据集水印方法需要根据不同编程语言特性分别设计触发器和水印特征，这不仅增加了计算复杂度，还导致泛化能力不足；2）现有基于代码风格转换的水印易被自动化检测识别，隐蔽性较差。针对这些问题，我们提出了ModMark——一种新颖的模型级数字水印嵌入方法。具体而言，通过微调分词器，ModMark实现了跨语言泛化能力，同时降低了水印设计的复杂度。此外，我们采用代码噪声注入技术有效防止触发器检测。实验结果表明，我们的方法能在多种编程语言的CSM上实现100%的水印验证率，同时确保ModMark的隐蔽性和有效性。