Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.
翻译:用于代码处理的大型语言模型(LLM)依赖于子词分词器,例如字节对编码(BPE),这些分词器从混合的自然语言文本和编程语言代码中学习,但主要受统计驱动而非语法驱动。因此,语义相同的代码片段可能因表面因素(如空格或标识符命名)的不同而被分词成不同的形式。为了衡量这种错位的影响,我们提出了TokDrift框架,该框架应用语义保持的重写规则来创建仅在分词上存在差异的代码变体。在包括参数量超过300亿的大型模型在内的九种代码LLM中,即使微小的格式更改也可能导致模型行为发生显著偏移。分层分析表明,问题源于早期嵌入层,其中子词分割未能捕捉语法词元的边界。我们的研究结果表明,错位的分词是实现可靠代码理解与生成的一个潜在障碍,突显了未来代码LLM需要采用语法感知的分词方法。