Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks

AI coding assistants have significantly improved developer productivity by automatically suggesting code that aligns with user intent, and many of these tools are now integrated directly into Integrated Development Environments (IDEs). Developers interact with code in two distinct cognitive modes: Flow and Command. While developers require tools that directly complete or infill code in unfinished programs during Flow mode, they also need tools that can comprehend intentions expressed as natural-language instructions and convert them into executable code in Command mode. Although instruction-tuned Large Language Models (LLMs) dominate many application scenarios due to their abilities to infer and fulfill developers' intents, it remains unclear whether the same paradigm is equally suitable for different code-related tasks. Therefore, it is necessary to understand how instruction tuning affects the feasibility of CodeLLMs as coding assistants. To fill this gap, we conduct the first empirical study that uncovers a key trade-off caused by instruction tuning across programming modes, which we term the Instruction-Tuning Tax. Our results show that instruction tuning is not a free lunch: although instruction-tuned models are more capable of following instructions and leveraging structured guidance, these gains often come at the cost of weaker infilling performance. We further extend our study through both qualitative and quantitative analyses, including manual failure categorization, behavioral metrics that capture generation fidelity, and intermediate-checkpoint evaluation throughout the tuning process. Summarizing our results into seven findings and four implications, our study offers a new perspective on the development of AI-powered coding tools and highlights the need to carefully balance instruction-following ability with effective code generation assistance.

翻译：人工智能编程助手通过自动建议符合用户意图的代码，显著提升了开发者的生产力，目前许多此类工具已直接集成到集成开发环境（IDE）中。开发者在两种截然不同的认知模式下与代码交互：流程模式与命令模式。尽管在流程模式下，开发者需要能够直接在未完成的程序中补全或填充代码的工具，但在命令模式下，他们同样需要能够理解以自然语言指令表达的意图并将其转化为可执行代码的工具。虽然指令微调的大型语言模型（LLM）凭借其推断和实现开发者意图的能力主导了许多应用场景，但尚不清楚同一范式是否同样适用于不同的代码相关任务。因此，有必要理解指令微调如何影响代码大型语言模型（CodeLLM）作为编程助手的可行性。为填补这一空白，我们进行了首次实证研究，揭示了一种由指令微调导致的跨编程模式关键权衡——我们将其称为指令微调代价。我们的结果表明，指令微调并非免费的午餐：尽管经过指令微调的模型更擅长遵循指令和利用结构化指导，但这些增益往往以降低代码填充性能为代价。我们进一步通过定性和定量分析扩展了研究，包括人工失败分类、捕捉生成保真度的行为指标，以及微调过程中的中间检查点评估。我们将研究结果总结为七项发现和四项启示，为人工智能驱动编程工具的发展提供了新视角，并强调了在遵循指令能力与有效代码生成辅助之间谨慎权衡的必要性。