Pitfalls in Experiments with DNN4SE: An Analysis of the State of the Practice

Software engineering techniques are increasingly relying on deep learning approaches to support many software engineering tasks, from bug triaging to code generation. To assess the efficacy of such techniques researchers typically perform controlled experiments. Conducting these experiments, however, is particularly challenging given the complexity of the space of variables involved, from specialized and intricate architectures and algorithms to a large number of training hyper-parameters and choices of evolving datasets, all compounded by how rapidly the machine learning technology is advancing, and the inherent sources of randomness in the training process. In this work we conduct a mapping study, examining 194 experiments with techniques that rely on deep neural networks appearing in 55 papers published in premier software engineering venues to provide a characterization of the state-of-the-practice, pinpointing experiments common trends and pitfalls. Our study reveals that most of the experiments, including those that have received ACM artifact badges, have fundamental limitations that raise doubts about the reliability of their findings. More specifically, we find: weak analyses to determine that there is a true relationship between independent and dependent variables (87% of the experiments); limited control over the space of DNN relevant variables, which can render a relationship between dependent variables and treatments that may not be causal but rather correlational (100% of the experiments); and lack of specificity in terms of what are the DNN variables and their values utilized in the experiments (86% of the experiments) to define the treatments being applied, which makes it unclear whether the techniques designed are the ones being assessed, or how the sources of extraneous variation are controlled. We provide some practical recommendations to address these limitations.

翻译：软件工程技术日益依赖深度学习方法以支持诸多软件工程任务，从缺陷分类到代码生成。为评估此类技术的有效性，研究者通常进行受控实验。然而，由于涉及变量空间的复杂性——包括专业化的复杂架构与算法、大量训练超参数、不断演进的数据集选择，加之机器学习技术的快速进步以及训练过程中固有的随机性来源，开展这些实验面临巨大挑战。本研究通过一项映射研究，审视了发表于顶级软件工程期刊的55篇论文中依赖深度神经网络的194个实验，旨在刻画当前实践现状，并精准指出实验中的常见趋势与陷阱。研究发现，大多数实验（包括获得ACM工件徽章的实验）存在根本性局限，足以质疑其结果的可靠性。具体而言，我们发现：在判定自变量与因变量之间是否存在真实关系时分析薄弱（占实验的87%）；对DNN相关变量空间的控制有限（占实验的100%），这可能导致因变量与处理之间的关系并非因果关系而仅为相关关系；对实验中所用DNN变量及其值的具体说明缺失（占实验的86%），这使得难以明确所评估的是否为设计的技术，或如何控制外部变异来源。我们提出了一些实用建议以应对这些局限。

相关内容

Engineering

关注 7

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

116+阅读 · 2020年4月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日