Large Language Models (LLMs) have demonstrated exceptional code generation capabilities, yet their token-level mechanisms remain underexplored, particularly in compressed models. Through systematic analysis of programming language token representations, we characterize how programming languages are encoded in LLM tokenizers by analyzing their vocabulary distribution and keyword coverage patterns. We introduce a novel cold-start probability analysis method that provides insights into model behavior without requiring explicit prompts. Additionally, we present a comprehensive evaluation of how different model optimization techniques - including quantization, distillation, model scaling, and task-specific fine-tuning - affect token-level representations and code generation quality. Our experiments, supported by comprehensive probability distribution analysis and evaluation metrics, reveal critical insights into token-level behavior and provide empirically-validated guidelines for maintaining code generation quality under various optimization constraints. These findings advance both theoretical understanding of LLM code generation and practical implementation of optimized models in production environments.
翻译:大型语言模型(LLM)已展现出卓越的代码生成能力,但其标记层面的工作机制仍未得到充分探索,尤其是在压缩模型中。通过对编程语言标记表示的系统性分析,我们通过分析词汇分布与关键词覆盖模式,刻画了编程语言在LLM分词器中的编码方式。我们提出了一种新颖的冷启动概率分析方法,该方法无需显式提示即可洞察模型行为。此外,我们对不同模型优化技术——包括量化、蒸馏、模型缩放和任务特定微调——如何影响标记级表示及代码生成质量进行了全面评估。基于综合概率分布分析和评估指标的实验,揭示了标记级行为的关键洞见,并为在各种优化约束下保持代码生成质量提供了经验验证的指导原则。这些发现不仅推进了对LLM代码生成的理论理解,也为生产环境中优化模型的实践部署提供了依据。