Code Large Language Models (Code LLMs) have revolutionized software development but raised critical concerns regarding code provenance, copyright protection, and security. Existing code watermarking approaches suffer from two fundamental limitations: black-box methods either exhibit detectable syntactic patterns vulnerable to statistical analysis or rely on implicit neural embedding behaviors that weaken interpretability, auditability, and precise control, while white-box methods lack code-aware capabilities that may compromise functionality. Moreover, current single-layer watermarking schemes fail to address increasingly complex provenance requirements such as multi-level attribution and version tracking. We present MATRIX, a novel code watermarking framework that formulates watermark encoding as solving constrained parity-check matrix equations. MATRIX employs dual-channel watermarking through variable naming and semantic-preserving transformations, enhancing watermark coverage across a wider range of code while ensuring mutual backup for robustness. By integrating BCH error-correction codes with solution space diversity, our approach achieves robustness against statistical analysis. Extensive evaluation on Python code generated by multiple Code LLMs demonstrates that MATRIX achieves an average watermark detection accuracy of 99.20% with minimal functionality loss (0-0.14%), improves robustness by 7.70-26.67% against various attacks, and increases watermarking applicability by 2-6x compared with existing methods. These results establish MATRIX as an effective solution for complex code provenance scenarios while balancing among detectability, fidelity, and robustness.
翻译:[translated abstract in Chinese]
代码大语言模型(Code LLMs)彻底改变了软件开发,但也引发了关于代码溯源、版权保护和安全性等关键问题。现有代码水印方法存在两个根本性局限:黑盒方法要么表现出易受统计分析的可检测句法模式,要么依赖削弱可解释性、可审计性和精确控制的隐式神经嵌入行为;而白盒方法则缺乏代码感知能力,可能损害功能完整性。此外,当前单层水印方案难以应对日益复杂的溯源需求(如多级归属验证和版本追踪)。本文提出MATRIX——一种新型代码水印框架,将水印编码形式化为求解约束奇偶校验矩阵方程。MATRIX通过变量命名和语义保持变换实现双通道水印,在更大范围的代码中增强水印覆盖率,同时通过互备机制保障鲁棒性。通过将BCH纠错码与解空间多样性相结合,本方法实现了对统计分析的抗干扰能力。在多个Code LLM生成的Python代码上的广泛评估表明,MATRIX实现了99.20%的平均水印检测精度,功能损失极小(0-0.14%),针对各类攻击的鲁棒性提升7.70-26.67%,水印适用性较现有方法提升2-6倍。这些结果证明了MATRIX在兼顾可检测性、保真度和鲁棒性的同时,为复杂代码溯源场景提供了有效解决方案。