On the Reliability and Explainability of Language Models for Program Generation

Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and evaluated on various benchmark datasets, demonstrating promising performance. However, there is still uncertainty about the reliability of these models, particularly their realistic ability to consistently transform code sequences. This raises the question: are these techniques sufficiently trustworthy for automated program generation? Consequently, Further research is needed to understand model logic and assess reliability and explainability. To bridge these research gaps, we conduct a thorough empirical study of eight popular language models on five representative datasets to determine the capabilities and limitations of automated program generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation. We discover that state-of-the-art approaches suffer from inappropriate performance evaluation stemming from severe data duplication, causing over-optimistic results. Our explainability analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences. Overall, more rigorous evaluation approaches and benchmarks are critical to enhance the reliability and explainability of automated program generation moving forward. Our findings provide important guidelines for this goal.

翻译：近期研究采用CodeT5、CodeGPT等预训练语言模型，用于代码生成、修复和翻译等自动化程序生成任务。大量基于语言模型的方法已在各类基准数据集上得到评估并展现出优异性能。然而，这些模型的可靠性仍存在不确定性，特别是它们一致转换代码序列的实际能力。这引发了一个问题：这些技术是否足够可靠以用于自动化程序生成？因此，需要进一步研究以理解模型逻辑并评估其可靠性与可解释性。为填补这些研究空白，我们对5个代表性数据集上的8种主流语言模型进行了全面的实证研究，以确定自动化程序生成方法的能力与局限。我们进一步采用先进的可解释人工智能方法，突显对代码转换贡献显著的标记。研究发现，当前最优方法因严重的数据重复而导致不恰当的性能评估，产生过度乐观的结果。我们的可解释性分析表明，在多种实验场景下，语言模型能识别代码语法与结构信息，但对输入序列变化的鲁棒性有限。总体而言，更严格的评估方法和基准对于提升自动化程序生成的可靠性与可解释性至关重要。我们的发现为实现这一目标提供了重要指导。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日