Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \textit{gibberish bias}. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger vocabulary'' trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.
翻译:代码秘密是软件开发人员的敏感资产,其泄露会带来重大网络安全风险。尽管由代码大语言模型(CLLMs)驱动的AI代码助手发展迅速,但研究表明,CLLMs会因臭名昭著的记忆现象而无意中泄露这些秘密。本研究首次揭示字节对编码(BPE)分词会导致秘密记忆的意外行为,我们将其称为“乱码偏差”。具体而言,我们发现某些秘密是CLLMs最容易记忆的类型。这些秘密具有高字符级熵,但低分词级熵。随后,本文通过数值数据支持这一偏差观点。我们确定偏差的根源在于CLLM训练数据与秘密数据之间的分词分布偏移。我们进一步讨论了在“更大词表”趋势下乱码偏差的表现形式。最后,本文探讨了潜在缓解策略及对当前分词器设计的更广泛影响。