Defect reduction planning plays a vital role in enhancing software quality and minimizing software maintenance costs. By training a black box machine learning model and "explaining" its predictions, explainable AI for software engineering aims to identify the code characteristics that impact maintenance risks. However, post-hoc explanations do not always faithfully reflect what the original model computes. In this paper, we introduce CounterACT, a Counterfactual ACTion rule mining approach that can generate defect reduction plans without black-box models. By leveraging action rules, CounterACT provides a course of action that can be considered as a counterfactual explanation for the class (e.g., buggy or not buggy) assigned to a piece of code. We compare the effectiveness of CounterACT with the original action rule mining algorithm and six established defect reduction approaches on 9 software projects. Our evaluation is based on (a) overlap scores between proposed code changes and actual developer modifications; (b) improvement scores in future releases; and (c) the precision, recall, and F1-score of the plans. Our results show that, compared to competing approaches, CounterACT's explainable plans achieve higher overlap scores at the release level (median 95%) and commit level (median 85.97%), and they offer better trade-off between precision and recall (median F1-score 88.12%). Finally, we venture beyond planning and explore leveraging Large Language models (LLM) for generating code edits from our generated plans. Our results show that suggested LLM code edits supported by our plans are actionable and are more likely to pass relevant test cases than vanilla LLM code recommendations.
翻译:缺陷减少规划在提升软件质量和降低软件维护成本方面起着至关重要的作用。通过训练一个黑盒机器学习模型并"解释"其预测,面向软件工程的可解释人工智能旨在识别影响维护风险的代码特征。然而,事后解释并不总是忠实地反映原始模型的计算过程。本文中,我们介绍了CounterACT,一种反事实行动规则挖掘方法,它无需依赖黑盒模型即可生成缺陷减少计划。通过利用行动规则,CounterACT提供了一系列可被视为对代码片段所属类别(例如,有缺陷或无缺陷)的反事实解释的行动方案。我们在9个软件项目上,将CounterACT的有效性与原始行动规则挖掘算法以及六种成熟的缺陷减少方法进行了比较。我们的评估基于:(a)建议的代码变更与实际开发者修改之间的重叠分数;(b)未来版本中的改进分数;以及(c)计划的精确率、召回率和F1分数。我们的结果表明,与竞争方法相比,CounterACT的可解释计划在发布级别(中位数95%)和提交级别(中位数85.97%)上获得了更高的重叠分数,并且在精确率和召回率之间提供了更好的权衡(中位数F1分数88.12%)。最后,我们超越规划本身,探索利用大型语言模型(LLM)根据我们生成的计划来生成代码编辑。我们的结果表明,由我们的计划支持的LLM建议的代码编辑是可操作的,并且比未经优化的LLM代码建议更有可能通过相关的测试用例。