Large language models fine-tuned on instruction-code pairs may memorize and subsequently leak sensitive training data. Existing differentially private (DP) code generation methods primarily protect code snippets while assuming prompts are public, which fails in realistic scenarios where prompts may also contain sensitive information. When prompts cannot be explicitly learned or used during generation, code synthesis suffers from severe utility degradation as well as reduced diversity and fidelity. To address these challenges, we propose PrivCode-Plus, the first work to explore DP code generation where both prompts and code snippets are considered sensitive in LLM fine-tuning. PrivCode-Plus introduces a two-stage DP framework with a Privacy-Free Latent Conditioning module, enabling effective DP fine-tuning and data synthesis without direct access to sensitive prompts or code. Extensive experiments show that PrivCode-Plus achieves substantially higher utility than baselines, remains competitive with the method with relaxing privacy assumptions, and provides stronger privacy guarantees.
翻译:大语言模型在指令-代码对上进行微调时,可能会记忆并随后泄露敏感的训练数据。现有的差分隐私(DP)代码生成方法主要保护代码片段,同时假设提示是公开的,但在提示也可能包含敏感信息的现实场景中,这种方法难以奏效。当提示在生成过程中无法被显式学习或使用时,代码合成会遭受严重的效用退化,以及多样性和保真度的降低。为应对这些挑战,我们提出PrivCode-Plus,这是首个探索在LLM微调中同时将提示和代码片段视为敏感的DP代码生成工作。PrivCode-Plus引入了一个两阶段DP框架,并配备了一个无隐私风险潜在条件化模块,从而能够在无需直接访问敏感提示或代码的情况下,实现有效的DP微调和数据合成。大量实验表明,PrivCode-Plus 在效用上显著优于基线方法,与放宽隐私假设的方法相比具有竞争力,并提供了更强的隐私保障。