Software engineering techniques are increasingly relying on deep learning approaches to support many software engineering tasks, from bug triaging to code generation. To assess the efficacy of such techniques researchers typically perform controlled experiments. Conducting these experiments, however, is particularly challenging given the complexity of the space of variables involved, from specialized and intricate architectures and algorithms to a large number of training hyper-parameters and choices of evolving datasets, all compounded by how rapidly the machine learning technology is advancing, and the inherent sources of randomness in the training process. In this work we conduct a mapping study, examining 194 experiments with techniques that rely on deep neural networks appearing in 55 papers published in premier software engineering venues to provide a characterization of the state-of-the-practice, pinpointing experiments common trends and pitfalls. Our study reveals that most of the experiments, including those that have received ACM artifact badges, have fundamental limitations that raise doubts about the reliability of their findings. More specifically, we find: weak analyses to determine that there is a true relationship between independent and dependent variables (87% of the experiments); limited control over the space of DNN relevant variables, which can render a relationship between dependent variables and treatments that may not be causal but rather correlational (100% of the experiments); and lack of specificity in terms of what are the DNN variables and their values utilized in the experiments (86% of the experiments) to define the treatments being applied, which makes it unclear whether the techniques designed are the ones being assessed, or how the sources of extraneous variation are controlled. We provide some practical recommendations to address these limitations.
翻译:软件工程技术日益依赖深度学习方法以支持诸多软件工程任务,从缺陷分类到代码生成。为评估此类技术的有效性,研究者通常进行受控实验。然而,由于涉及变量空间的复杂性——包括专业化的复杂架构与算法、大量训练超参数、不断演进的数据集选择,加之机器学习技术的快速进步以及训练过程中固有的随机性来源,开展这些实验面临巨大挑战。本研究通过一项映射研究,审视了发表于顶级软件工程期刊的55篇论文中依赖深度神经网络的194个实验,旨在刻画当前实践现状,并精准指出实验中的常见趋势与陷阱。研究发现,大多数实验(包括获得ACM工件徽章的实验)存在根本性局限,足以质疑其结果的可靠性。具体而言,我们发现:在判定自变量与因变量之间是否存在真实关系时分析薄弱(占实验的87%);对DNN相关变量空间的控制有限(占实验的100%),这可能导致因变量与处理之间的关系并非因果关系而仅为相关关系;对实验中所用DNN变量及其值的具体说明缺失(占实验的86%),这使得难以明确所评估的是否为设计的技术,或如何控制外部变异来源。我们提出了一些实用建议以应对这些局限。