On the Reliability and Explainability of Language Models for Program Generation

Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and evaluated on various benchmark datasets, demonstrating promising performance. However, there is still uncertainty about the reliability of these models, particularly their realistic ability to consistently transform code sequences. This raises the question: are these techniques sufficiently trustworthy for automated program generation? Consequently, Further research is needed to understand model logic and assess reliability and explainability. To bridge these research gaps, we conduct a thorough empirical study of eight popular language models on five representative datasets to determine the capabilities and limitations of automated program generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation. We discover that state-of-the-art approaches suffer from inappropriate performance evaluation stemming from severe data duplication, causing over-optimistic results. Our explainability analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences. Overall, more rigorous evaluation approaches and benchmarks are critical to enhance the reliability and explainability of automated program generation moving forward. Our findings provide important guidelines for this goal.

翻译：近期研究采用预训练语言模型（如CodeT5和CodeGPT）执行自动化程序生成任务，包括代码生成、代码修复与代码翻译。大量基于语言模型的方法已在各类基准数据集上经过评估，展现出令人期待的性能。然而，这些模型的可靠性仍存不确定性，尤其是其稳定转换代码序列的实际能力。这引出一个问题：这些技术是否足够可靠以应用于自动化程序生成？因此，需要进一步研究以理解模型逻辑并评估其可靠性与可解释性。为弥补这些研究空白，我们对5个代表性数据集上的8种主流语言模型开展了全面的实证研究，以确定自动化程序生成方法的能力边界。我们进一步采用先进的可解释人工智能方法，突出显示对代码转换贡献显著的词元。研究发现，当前最先进的方法因严重的数据重复导致性能评估失当，产生过度乐观的结果。可解释性分析表明，在不同实验场景下，语言模型虽能识别代码语法与结构信息，但对输入序列变化的鲁棒性有限。总体而言，更严格的评估方法与基准数据集对于提升自动化程序生成的可靠性与可解释性至关重要。我们的发现为此目标提供了重要指导原则。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日