Large Language Models (LLMs) have demonstrated exceptional code generation capabilities, yet their token-level mechanisms remain underexplored, particularly in compressed models. Through systematic analysis of programming language token representations, we characterize how programming languages are encoded in LLM tokenizers by analyzing their vocabulary distribution and keyword coverage patterns. We introduce a novel cold-start probability analysis method that provides insights into model behavior without requiring explicit prompts. Additionally, we present a comprehensive evaluation of how different model optimization techniques - including quantization, distillation, model scaling, and task-specific fine-tuning - affect token-level representations and code generation quality. Our experiments, supported by comprehensive probability distribution analysis and evaluation metrics, reveal critical insights into token-level behavior and provide empirically-validated guidelines for maintaining code generation quality under various optimization constraints. These findings advance both theoretical understanding of LLM code generation and practical implementation of optimized models in production environments.
翻译:大型语言模型(LLM)已展现出卓越的代码生成能力,但其词元级工作机制仍未得到充分探索,尤其是在压缩模型中。通过对编程语言词元表示的系统分析,我们通过考察词汇分布与关键词覆盖模式,刻画了编程语言在LLM分词器中的编码方式。我们提出了一种新颖的冷启动概率分析方法,该方法无需显式提示即可洞察模型行为。此外,我们全面评估了不同模型优化技术——包括量化、蒸馏、模型缩放及任务特定微调——如何影响词元级表示与代码生成质量。通过综合概率分布分析与评估指标支持的实验,我们揭示了词元级行为的关键机制,并为在各种优化约束下维持代码生成质量提供了经验证实的指导原则。这些发现不仅推进了对LLM代码生成的理论理解,也为生产环境中优化模型的实际部署提供了实践依据。