Automatic code generation, the task of generating new code snippets from existing code or comments, has long been of interest. Numerous code generation models have been proposed and proven on different benchmark datasets. However, little is known about whether this objective has been achieved and why code generation models effectively transform code sequences automatically. In other words, can we totally trust these automated code generation models? Consequently, there is a pressing need to understand the inner logic of code generation models and to investigate their replicability, reliability, and explainability. To bridge these research gaps, we conduct a thorough empirical study of five code generation models on four representative code generation datasets to assess the limits and capabilities of automatic code generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code generation. Experiments demonstrate that we successfully replicate state-of-the-art code generation approaches. We discover that state-of-the-art approaches suffer from severe data duplication and input insensitivity, which are subtle issues with significant implications. Our explainability analysis reveals that, in various experimental scenarios, code generation models can recognize code grammar and structural information, but can not capture key tokens that need to be updated. Our results draw several lessons and guidelines for future work in this area.
翻译:自动代码生成,即从现有代码或注释生成新代码片段的任务,长期以来备受关注。目前已提出众多代码生成模型,并在不同基准数据集上得到验证。然而,关于这一目标是否真正实现,以及代码生成模型为何能有效自动转换代码序列,我们知之甚少。换言之,我们能否完全信任这些自动代码生成模型?因此,亟需理解代码生成模型的内在逻辑,并探究其可复现性、可靠性与可解释性。为填补这些研究空白,我们对五个代码生成模型在四个代表性代码生成数据集上进行了全面的实证研究,以评估自动代码生成方法的局限与能力。我们进一步采用先进的可解释人工智能方法,突出显示对代码生成具有显著贡献的标记。实验结果表明,我们成功复现了最先进的代码生成方法。我们发现,这些最先进方法存在严重的数据重复和输入不敏感问题,这些微妙问题具有重大影响。我们的可解释性分析揭示,在各种实验场景中,代码生成模型能够识别代码语法和结构信息,但无法捕获需要更新的关键标记。我们的研究结果为该领域的未来工作提供了若干经验教训与指导原则。