Neural Code Completion Tools (NCCTs) have reshaped the field of software engineering, which are built upon the language modeling technique and can accurately suggest contextually relevant code snippets. However, language models may emit the training data verbatim during inference with appropriate prompts. This memorization property raises privacy concerns of NCCTs about hard-coded credential leakage, leading to unauthorized access to applications, systems, or networks. Therefore, to answer whether NCCTs will emit the hard-coded credential, we propose an evaluation tool called Hard-coded Credential Revealer (HCR). HCR constructs test prompts based on GitHub code files with credentials to reveal the memorization phenomenon of NCCTs. Then, HCR designs four filters to filter out ill-formatted credentials. Finally, HCR directly checks the validity of a set of non-sensitive credentials. We apply HCR to evaluate three representative types of NCCTs: Commercial NCCTs, open-source models, and chatbots with code completion capability. Our experimental results show that NCCTs can not only return the precise piece of their training data but also inadvertently leak additional secret strings. Notably, two valid credentials were identified during our experiments. Therefore, HCR raises a severe privacy concern about the potential leakage of hard-coded credentials in the training data of commercial NCCTs. All artifacts and data are released for future research purposes in https://github.com/HCR-Repo/HCR.
翻译:神经代码补全工具(NCCTs)已彻底改变了软件工程领域。这类工具基于语言建模技术构建,能够精准推荐上下文相关的代码片段。然而,语言模型在适当提示下可能逐字输出训练数据。这种记忆特性引发了关于NCCTs泄露硬编码凭证的隐私担忧,可能导致应用程序、系统或网络遭到未授权访问。为探究NCCTs是否会产生硬编码凭证,我们提出了名为"硬编码凭证揭示器"(HCR)的评估工具。HCR基于含有凭证的GitHub代码文件构建测试提示,以揭示NCCTs的记忆现象;随后设计四个过滤器筛除格式不规范的凭证;最终直接校验一组非敏感凭证的有效性。我们应用HCR对三类典型NCCTs进行评估:商业NCCTs、开源模型及具备代码补全功能的聊天机器人。实验结果表明,NCCTs不仅能精确输出训练数据片段,还会无意泄露额外秘密字符串。值得注意的是,我们在实验过程中识别出两个有效凭证。因此,HCR揭示了商业NCCTs训练数据中硬编码凭证潜在泄露的严重隐私问题。所有相关制品与数据均已发布至https://github.com/HCR-Repo/HCR,供未来研究使用。