We present SWaRL, a robust and fidelity-preserving watermarking framework designed to protect the intellectual property of code LLM owners by embedding unique and verifiable signatures in the generated output. Existing approaches rely on manually crafted transformation rules to preserve watermarked code functionality or manipulate token-generation probabilities at inference time, which are prone to compilation errors. To address these challenges, SWaRL employs a reinforcement learning-based co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability. Furthermore, SWaRL employs low-rank adaptation (LoRA) during fine-tuning, allowing the learned watermark information to be transferable across model updates. Extensive experiments show that SWaRL achieves higher watermark detection accuracy compared to prior methods while fully maintaining watermarked code functionality. The LoRA-based signature embedding steers the base model to generate and solve code in a watermark-specific manner without significant computational overhead. Moreover, SWaRL exhibits strong resilience against refactoring and adversarial transformation attacks.
翻译:本文提出SWaRL,一种鲁棒且保真度的水印框架,旨在通过向生成代码中嵌入独特且可验证的签名来保护代码大语言模型所有者的知识产权。现有方法依赖人工设计的转换规则来维持水印代码功能,或在推理阶段操纵词元生成概率,这些方法容易引发编译错误。为解决这些挑战,SWaRL采用基于强化学习的协同训练框架,利用编译器反馈确保功能正确性,并联合训练一个保密验证器作为奖励信号以维持水印可检测性。此外,SWaRL在微调阶段采用低秩自适应技术,使学习到的水印信息能够跨模型更新进行迁移。大量实验表明,与现有方法相比,SWaRL在完全保持水印代码功能性的同时实现了更高的水印检测准确率。基于LoRA的签名嵌入引导基础模型以水印特定的方式生成和解决代码问题,且未引入显著计算开销。此外,SWaRL对代码重构和对抗性转换攻击表现出强大的抵御能力。