Leveraging recent advancements in large language models, modern neural code completion models have demonstrated the capability to generate highly accurate code suggestions. However, their massive size poses challenges in terms of computational costs and environmental impact, hindering their widespread adoption in practical scenarios. Dynamic inference emerges as a promising solution, as it allocates minimal computation during inference while maintaining the model's performance. In this research, we explore dynamic inference within the context of code completion. Initially, we conducted an empirical investigation on GPT-2, focusing on the inference capabilities of intermediate layers for code completion. We found that 54.4% of tokens can be accurately generated using just the first layer, signifying significant computational savings potential. Moreover, despite using all layers, the model still fails to predict 14.5% of tokens correctly, and the subsequent completions continued from them are rarely considered helpful, with only a 4.2% Acceptance Rate. These findings motivate our exploration of dynamic inference in code completion and inspire us to enhance it with a decision-making mechanism that stops the generation of incorrect code. We thus propose a novel dynamic inference method specifically tailored for code completion models. This method aims not only to produce correct predictions with largely reduced computation but also to prevent incorrect predictions proactively. Our extensive evaluation shows that it can averagely skip 1.7 layers out of 16 layers in the models, leading to an 11.2% speedup with only a marginal 1.1% reduction in ROUGE-L.
翻译:利用大型语言模型的最新进展,现代神经代码补全模型已展现出生成高精度代码建议的能力。然而,其庞大体积带来了计算成本与环境影响的挑战,阻碍了其在实践场景中的广泛采用。动态推理作为一种有前景的解决方案应运而生,它在推理过程中分配最小计算量,同时保持模型性能。在本研究中,我们探讨了动态推理在代码补全中的应用。首先,我们在GPT-2上进行了实证研究,聚焦于中间层在代码补全中的推理能力。我们发现,54.4%的令牌仅使用第一层即可准确生成,这标志着巨大的计算节省潜力。此外,即便使用所有层,模型仍无法正确预测14.5%的令牌,而由此生成的后续补全几乎不被认为有用,接受率仅为4.2%。这些发现促使我们探索代码补全中的动态推理,并启发我们通过一种决策机制增强该方法,以停止生成错误代码。因此,我们提出了一种专门针对代码补全模型的新型动态推理方法。该方法旨在不仅以大幅减少的计算量产生正确预测,还能主动防止错误预测的产生。广泛评估表明,该方法可平均跳过模型中16层中的1.7层,实现11.2%的加速,而ROUGE-L仅略有下降1.1%。