Large Language Models (LLMs) have been demonstrated effective for code generation. Due to the complexity and opacity of LLMs, little is known about how these models generate code. To deepen our understanding, we investigate whether LLMs attend to the same parts of a natural language description as human programmers during code generation. An analysis of five LLMs on a popular benchmark, HumanEval, revealed a consistent misalignment between LLMs' and programmers' attention. Furthermore, we found that there is no correlation between the code generation accuracy of LLMs and their alignment with human programmers. Through a quantitative experiment and a user study, we confirmed that, among twelve different attention computation methods, attention computed by the perturbation-based method is most aligned with human attention and is constantly favored by human programmers. Our findings highlight the need for human-aligned LLMs for better interpretability and programmer trust.
翻译:大语言模型(LLMs)已被证明在代码生成任务中具有有效性。然而,由于LLMs的复杂性和不透明性,人们对其生成代码的机制知之甚少。为加深理解,我们探究了LLMs在代码生成过程中,是否关注自然语言描述中与人类程序员相同的部分。通过对五个LLMs在流行基准测试HumanEval上的分析,揭示了LLMs与程序员注意力之间持续存在的失配现象。此外,我们发现LLMs的代码生成准确率与其与人类注意力的对齐程度之间不存在相关性。通过定量实验和用户研究,我们证实:在十二种不同注意力计算方法中,基于扰动方法计算的注意力与人类注意力最为对齐,且始终受到人类程序员的青睐。我们的研究结果凸显了开发与人类对齐的LLMs对于提升可解释性和程序员信任度的必要性。