Your Code Secret Belongs to Me: Neural Code Completion Tools Can Memorize Hard-Coded Credentials

Neural Code Completion Tools (NCCTs) have reshaped the field of software engineering, which are built upon the language modeling technique and can accurately suggest contextually relevant code snippets. However, language models may emit the training data verbatim during inference with appropriate prompts. This memorization property raises privacy concerns of NCCTs about hard-coded credential leakage, leading to unauthorized access to applications, systems, or networks. Therefore, to answer whether NCCTs will emit the hard-coded credential, we propose an evaluation tool called Hard-coded Credential Revealer (HCR). HCR constructs test prompts based on GitHub code files with credentials to reveal the memorization phenomenon of NCCTs. Then, HCR designs four filters to filter out ill-formatted credentials. Finally, HCR directly checks the validity of a set of non-sensitive credentials. We apply HCR to evaluate three representative types of NCCTs: Commercial NCCTs, open-source models, and chatbots with code completion capability. Our experimental results show that NCCTs can not only return the precise piece of their training data but also inadvertently leak additional secret strings. Notably, two valid credentials were identified during our experiments. Therefore, HCR raises a severe privacy concern about the potential leakage of hard-coded credentials in the training data of commercial NCCTs. All artifacts and data are released for future research purposes in https://github.com/HCR-Repo/HCR.

翻译：神经代码补全工具（NCCTs）已彻底改变了软件工程领域。这类工具基于语言建模技术构建，能够精准推荐上下文相关的代码片段。然而，语言模型在适当提示下可能逐字输出训练数据。这种记忆特性引发了关于NCCTs泄露硬编码凭证的隐私担忧，可能导致应用程序、系统或网络遭到未授权访问。为探究NCCTs是否会产生硬编码凭证，我们提出了名为"硬编码凭证揭示器"（HCR）的评估工具。HCR基于含有凭证的GitHub代码文件构建测试提示，以揭示NCCTs的记忆现象；随后设计四个过滤器筛除格式不规范的凭证；最终直接校验一组非敏感凭证的有效性。我们应用HCR对三类典型NCCTs进行评估：商业NCCTs、开源模型及具备代码补全功能的聊天机器人。实验结果表明，NCCTs不仅能精确输出训练数据片段，还会无意泄露额外秘密字符串。值得注意的是，我们在实验过程中识别出两个有效凭证。因此，HCR揭示了商业NCCTs训练数据中硬编码凭证潜在泄露的严重隐私问题。所有相关制品与数据均已发布至https://github.com/HCR-Repo/HCR，供未来研究使用。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日