Instruction-tuned Language Models (ILMs) have become essential components of modern AI systems, demonstrating exceptional versatility across natural language and reasoning tasks. Among their most impactful applications is code generation, where ILMs -- commonly referred to as Code Language Models (CLMs) -- translate human intent into executable programs. While progress has been driven by advances in scaling and training methodologies, one critical aspect remains underexplored: the impact of system prompts on both general-purpose ILMs and specialized CLMs for code generation. We systematically evaluate how system prompts of varying instructional detail, along with model scale, prompting strategy, and programming language, affect code assistant. Our experimental setting spans 360 configurations across four models, five system prompts, three prompting strategies, two languages, and two temperature settings. We find that (1) increasing system-prompt constraint specificity does not monotonically improve correctness -- prompt effectiveness is configuration-dependent and can help or hinder based on alignment with task requirements and decoding context; (2) for larger code-specialized models, few-shot examples can degrade performance relative to zero-shot generation, contrary to conventional wisdom; and (3) programming language matters, with Java exhibiting significantly greater sensitivity to system prompt variations than Python, suggesting language-specific prompt engineering strategies may be necessary.
翻译:指令调优语言模型已成为现代人工智能系统的核心组件,在自然语言与推理任务中展现出卓越的通用性。其中最具影响力的应用之一是代码生成领域,指令调优模型(通常称为代码语言模型)能够将人类意图转化为可执行程序。尽管模型规模扩展与训练方法的进步推动了该领域发展,但一个关键维度仍未得到充分探索:系统提示对通用指令调优模型与专用代码生成模型的影响机制。本文系统评估了不同精细程度的系统提示,结合模型规模、提示策略与编程语言等因素,如何影响代码生成性能。实验设置涵盖四个模型、五种系统提示、三种提示策略、两种编程语言及两种温度参数,共计360种配置组合。研究发现:(1)增强系统提示的约束特异性并不能单调提升正确性——提示有效性具有配置依赖性,其效果取决于与任务需求及解码语境的匹配程度;(2)对于大规模代码专用模型,少量示例提示可能反而会降低零样本生成性能,这与传统认知相悖;(3)编程语言具有显著影响,Java对系统提示变化的敏感度远高于Python,这表明需要针对不同语言制定差异化的提示工程策略。