Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and evaluated on various benchmark datasets, demonstrating promising performance. However, there is still uncertainty about the reliability of these models, particularly their realistic ability to consistently transform code sequences. This raises the question: are these techniques sufficiently trustworthy for automated program generation? Consequently, Further research is needed to understand model logic and assess reliability and explainability. To bridge these research gaps, we conduct a thorough empirical study of eight popular language models on five representative datasets to determine the capabilities and limitations of automated program generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation. We discover that state-of-the-art approaches suffer from inappropriate performance evaluation stemming from severe data duplication, causing over-optimistic results. Our explainability analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences. Overall, more rigorous evaluation approaches and benchmarks are critical to enhance the reliability and explainability of automated program generation moving forward. Our findings provide important guidelines for this goal.
翻译:近期研究采用预训练语言模型(如CodeT5和CodeGPT)执行自动化程序生成任务,包括代码生成、代码修复与代码翻译。大量基于语言模型的方法已在各类基准数据集上经过评估,展现出令人期待的性能。然而,这些模型的可靠性仍存不确定性,尤其是其稳定转换代码序列的实际能力。这引出一个问题:这些技术是否足够可靠以应用于自动化程序生成?因此,需要进一步研究以理解模型逻辑并评估其可靠性与可解释性。为弥补这些研究空白,我们对5个代表性数据集上的8种主流语言模型开展了全面的实证研究,以确定自动化程序生成方法的能力边界。我们进一步采用先进的可解释人工智能方法,突出显示对代码转换贡献显著的词元。研究发现,当前最先进的方法因严重的数据重复导致性能评估失当,产生过度乐观的结果。可解释性分析表明,在不同实验场景下,语言模型虽能识别代码语法与结构信息,但对输入序列变化的鲁棒性有限。总体而言,更严格的评估方法与基准数据集对于提升自动化程序生成的可靠性与可解释性至关重要。我们的发现为此目标提供了重要指导原则。