In this paper, we investigate the naturalness of semantic-preserving transformations and their impacts on the evaluation of NPR. To achieve this, we conduct a two-stage human study, including (1) interviews with senior software developers to establish the first concrete criteria for assessing the naturalness of code transformations and (2) a survey involving 10 developers to assess the naturalness of 1178 transformations, i.e., pairs of original and transformed programs, applied to 225 real-world bugs. Our findings reveal that nearly 60% and 20% of these transformations are considered natural and unnatural with substantially high agreement among human annotators. Furthermore, the unnatural code transformations introduce a 25.2% false alarm rate on robustness of five well-known NPR systems. Additionally, the performance of the NPR systems drops notably when evaluated using natural transformations, i.e., a drop of up to 22.9% and 23.6% in terms of the numbers of correct and plausible patches generated by these systems. These results highlight the importance of robustness testing by considering naturalness of code transformations, which unveils true effectiveness of NPR systems. Finally, we conduct an exploration study on automating the assessment of naturalness of code transformations by deriving a new naturalness metric based on Cross-Entropy. Based on our naturalness metric, we can effectively assess naturalness for code transformations automatically with an AUC of 0.7.
翻译:在本文中,我们研究了语义保持变换的自然性及其对神经程序修复(NPR)评估的影响。为此,我们开展了一项两阶段人类研究,包括:(1) 与高级软件开发者进行访谈,建立评估代码变换自然性的首个具体标准;(2) 邀请10名开发者对应用于225个真实世界缺陷的1178种变换(即原始程序与变换后程序对)的自然性进行评估。研究结果表明,近60%和20%的变换分别被认定为自然与不自然,且人类标注者间具有高度一致性。此外,不自然的代码变换在五个知名NPR系统的鲁棒性检验中引入了25.2%的误报率。同时,当使用自然变换进行评估时,NPR系统的性能显著下降——这些系统生成的正确补丁和合理补丁数量分别最多下降22.9%和23.6%。这些结果凸显了通过考虑代码变换自然性进行鲁棒性测试的重要性,从而揭示NPR系统的真实有效性。最后,我们基于交叉熵推导出一种新的自然性度量,开展了代码变换自然性自动评估的探索性研究。基于该自然性度量,我们能够以0.7的AUC有效自动评估代码变换的自然性。