Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and evaluated on various benchmark datasets, demonstrating promising performance. However, there is still uncertainty about the reliability of these models, particularly their realistic ability to consistently transform code sequences. This raises the question: are these techniques sufficiently trustworthy for automated program generation? Consequently, Further research is needed to understand model logic and assess reliability and explainability. To bridge these research gaps, we conduct a thorough empirical study of eight popular language models on five representative datasets to determine the capabilities and limitations of automated program generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation. We discover that state-of-the-art approaches suffer from inappropriate performance evaluation stemming from severe data duplication, causing over-optimistic results. Our explainability analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences. Overall, more rigorous evaluation approaches and benchmarks are critical to enhance the reliability and explainability of automated program generation moving forward. Our findings provide important guidelines for this goal.
翻译:近期研究采用CodeT5、CodeGPT等预训练语言模型,用于代码生成、修复和翻译等自动化程序生成任务。大量基于语言模型的方法已在各类基准数据集上得到评估并展现出优异性能。然而,这些模型的可靠性仍存在不确定性,特别是它们一致转换代码序列的实际能力。这引发了一个问题:这些技术是否足够可靠以用于自动化程序生成?因此,需要进一步研究以理解模型逻辑并评估其可靠性与可解释性。为填补这些研究空白,我们对5个代表性数据集上的8种主流语言模型进行了全面的实证研究,以确定自动化程序生成方法的能力与局限。我们进一步采用先进的可解释人工智能方法,突显对代码转换贡献显著的标记。研究发现,当前最优方法因严重的数据重复而导致不恰当的性能评估,产生过度乐观的结果。我们的可解释性分析表明,在多种实验场景下,语言模型能识别代码语法与结构信息,但对输入序列变化的鲁棒性有限。总体而言,更严格的评估方法和基准对于提升自动化程序生成的可靠性与可解释性至关重要。我们的发现为实现这一目标提供了重要指导。