LLM-based code assistants are becoming increasingly popular among developers. These tools help developers improve their coding efficiency and reduce errors by providing real-time suggestions based on the developer's codebase. While beneficial, the use of these tools can inadvertently expose the developer's proprietary code to the code assistant service provider during the development process. In this work, we propose a method to mitigate the risk of code leakage when using LLM-based code assistants. CodeCloak is a novel deep reinforcement learning agent that manipulates the prompts before sending them to the code assistant service. CodeCloak aims to achieve the following two contradictory goals: (i) minimizing code leakage, while (ii) preserving relevant and useful suggestions for the developer. Our evaluation, employing StarCoder and Code Llama, LLM-based code assistants models, demonstrates CodeCloak's effectiveness on a diverse set of code repositories of varying sizes, as well as its transferability across different models. We also designed a method for reconstructing the developer's original codebase from code segments sent to the code assistant service (i.e., prompts) during the development process, to thoroughly analyze code leakage risks and evaluate the effectiveness of CodeCloak under practical development scenarios.
翻译:基于大语言模型(LLM)的代码助手在开发者中日益普及。这些工具通过基于开发者代码库提供实时建议,帮助开发者提升编码效率并减少错误。尽管具有诸多益处,在使用这些工具时,开发者的专有代码可能在开发过程中无意间暴露给代码助手服务提供商。本研究提出一种方法,用于缓解使用基于LLM的代码助手时的代码泄露风险。CodeCloak是一种新颖的深度强化学习智能体,可在将提示发送至代码助手服务前对其进行处理。CodeCloak致力于实现以下两个相互矛盾的目标:(i)最小化代码泄露风险,同时(ii)为开发者保留相关且有用的建议。我们采用基于LLM的代码助手模型StarCoder和Code Llama进行评估,结果表明CodeCloak在不同规模的多样化代码库上均具有有效性,且在不同模型间具备可迁移性。我们还设计了一种方法,用于根据开发过程中发送至代码助手服务的代码片段(即提示)重构开发者的原始代码库,以深入分析代码泄露风险,并在实际开发场景下评估CodeCloak的有效性。