Effective code generation with language models hinges on two critical factors: accurately understanding the intent of the prompt and generating code that applies algorithmic reasoning to produce correct solutions capable of passing diverse test cases while adhering to the syntax of the target programming language. Unlike other language tasks, code generation requires more than accurate token prediction; it demands comprehension of solution-level and structural relationships rather than merely generating the most likely tokens. very large language model (VLLM) are capable of generating detailed steps toward the correct solution of complex tasks where reasoning is crucial in solving the problem. Such reasoning capabilities may be absent in smaller language models. Therefore, in this work, we distill the reasoning capabilities of a VLLM into a smaller, more efficient model that is faster and cheaper to deploy. Our approach trains the model to emulate the reasoning and problem-solving abilities of the VLLM by learning to identify correct solution pathways and establishing a structural correspondence between problem definitions and potential solutions through a novel method of structure-aware loss optimization. This enables the model to transcend token-level generation and to deeply grasp the overarching structure of solutions for given problems. Experimental results show that our fine-tuned model, developed through a cheap and simple to implement process, significantly outperforms our baseline model in terms of pass@1, average data flow, and average syntax match metrics across the MBPP, MBPP Plus, and HumanEval benchmarks.
翻译:语言模型实现有效代码生成的关键在于两个核心因素:准确理解提示的意图,以及生成能够应用算法推理以产生正确解决方案的代码,这些方案需能通过多样化的测试用例,同时符合目标编程语言的语法规范。与其他语言任务不同,代码生成不仅需要准确的词元预测,更要求理解解决方案层面和结构上的关联,而非仅仅生成最可能的词元序列。超大规模语言模型(VLLM)能够为复杂任务生成通向正确解决方案的详细步骤,其中推理能力对问题解决至关重要。此类推理能力在较小规模的语言模型中可能缺失。因此,在本研究中,我们将VLLM的推理能力蒸馏到一个更小、更高效的模型中,该模型部署速度更快、成本更低。我们的方法通过一种新颖的结构感知损失优化方法,训练模型学习识别正确的解决路径,并在问题定义与潜在解决方案之间建立结构对应关系,从而模拟VLLM的推理与问题解决能力。这使得模型能够超越词元级别的生成,深入把握给定问题解决方案的整体结构。实验结果表明,通过一个廉价且易于实施的流程开发出的微调模型,在MBPP、MBPP Plus和HumanEval基准测试中,在pass@1、平均数据流和平均语法匹配等指标上均显著优于基线模型。