Much research on Machine Learning testing relies on empirical studies that evaluate and show their potential. However, in this context empirical results are sensitive to a number of parameters that can adversely impact the results of the experiments and potentially lead to wrong conclusions (Type I errors, i.e., incorrectly rejecting the Null Hypothesis). To this end, we survey the related literature and identify 10 commonly adopted empirical evaluation hazards that may significantly impact experimental results. We then perform a sensitivity analysis on 30 influential studies that were published in top-tier SE venues, against our hazard set and demonstrate their criticality. Our findings indicate that all 10 hazards we identify have the potential to invalidate experimental findings, such as those made by the related literature, and should be handled properly. Going a step further, we propose a point set of 10 good empirical practices that has the potential to mitigate the impact of the hazards. We believe our work forms the first step towards raising awareness of the common pitfalls and good practices within the software engineering community and hopefully contribute towards setting particular expectations for empirical research in the field of deep learning testing.
翻译:大量关于机器学习测试的研究依赖于评估并展示其潜力的实证研究。然而,在此背景下,实证结果对若干参数敏感,这些参数可能对实验结果产生不利影响,并可能导致错误结论(即第一类错误,错误地拒绝零假设)。为此,我们调研了相关文献,识别出10种可能显著影响实验结果的常用实证评估危害。随后,我们对30篇发表在顶级软件工程会议上的有影响力的研究进行敏感性分析,针对我们的危害集展示了其关键性。我们的发现表明,我们识别的所有10种危害都有可能使实验结果(如相关文献所得结论)失效,因此应妥善处理。更进一步,我们提出了10条良好的实证实践要点,有望减轻这些危害的影响。我们相信,我们的工作迈出了在软件工程社区中提高对常见陷阱和良好实践认知的第一步,并有望为深度学习测试领域的实证研究设定特定期望做出贡献。