Large Language Models (LLMs) have recently been widely used for code generation. Due to the complexity and opacity of LLMs, little is known about how these models generate code. We made the first attempt to bridge this knowledge gap by investigating whether LLMs attend to the same parts of a task description as human programmers during code generation. An analysis of six LLMs, including GPT-4, on two popular code generation benchmarks revealed a consistent misalignment between LLMs' and programmers' attention. We manually analyzed 211 incorrect code snippets and found five attention patterns that can be used to explain many code generation errors. Finally, a user study showed that model attention computed by a perturbation-based method is often favored by human programmers. Our findings highlight the need for human-aligned LLMs for better interpretability and programmer trust.
翻译:大型语言模型(LLMs)近年来被广泛用于代码生成。由于LLMs的复杂性和不透明性,人们对这些模型如何生成代码知之甚少。我们首次尝试通过研究LLMs在代码生成过程中是否关注任务描述中与人类程序员相同的部分来弥合这一知识差距。对包括GPT-4在内的六个LLMs在两个流行的代码生成基准测试上的分析表明,LLMs与程序员的注意力存在持续的不对齐现象。我们手动分析了211个错误代码片段,发现了五种可用于解释许多代码生成错误的注意力模式。最后,一项用户研究表明,基于扰动方法计算的模型注意力通常更受人类程序员的青睐。我们的研究结果强调了需要开发与人类对齐的LLMs,以提高可解释性和程序员信任度。