Automated Code Revision (ACR) tools aim to reduce manual effort by automatically generating code revisions based on reviewer feedback. While ACR tools have shown promising performance on historical data, their real-world utility depends on their ability to handle similar code variants expressing the same issue - a property we define as consistency. However, the probabilistic nature of ACR tools often compromises consistency, which may lead to divergent revisions even for semantically equivalent code variants. In this paper, we investigate the extent to which ACR tools maintain consistency when presented with semantically equivalent code variants. To do so, we first designed nine types of semantics-preserving perturbations (SPP) and applied them to 2032 Java methods from real-world GitHub projects, generating over 10K perturbed variants for evaluation. Then we used these perturbations to evaluate the consistency of five state-of-the-art transformer-based ACR tools. We found that the ACR tools' ability to generate correct revisions can drop by up to 45.3%, when presented with semantically equivalent code. The closer the perturbation is to this targeted region, the more likely an ACR tool is to fail to generate the correct revision. We explored potential mitigation strategies that modify the input representation, but found that these attention-guiding heuristics yielded only marginal improvements, thus leaving the solution to this problem as an open research question.
翻译:自动化代码修订工具旨在通过基于审阅者反馈自动生成代码修订来减少人工工作量。尽管ACR工具在历史数据上表现出良好性能,但其实际效用取决于其处理表达相同问题的相似代码变体的能力——我们将此特性定义为一致性。然而,ACR工具的概率特性常常会损害一致性,即使对于语义等价的代码变体也可能产生分歧的修订。本文研究了ACR工具在面对语义等价代码变体时保持一致性的程度。为此,我们首先设计了九种语义保持扰动类型,并将其应用于来自真实GitHub项目的2032个Java方法,生成了超过1万个扰动变体用于评估。随后我们使用这些扰动评估了五种基于Transformer的先进ACR工具的一致性。研究发现,当面对语义等价代码时,ACR工具生成正确修订的能力最多可下降45.3%。扰动越接近目标区域,ACR工具越可能无法生成正确修订。我们探索了修改输入表示的潜在缓解策略,但发现这些注意力引导启发式方法仅带来边际改进,因此该问题的解决方案仍是开放的研究课题。