EvaluateXAI: A Framework to Evaluate the Reliability and Consistency of Rule-based XAI Techniques for Software Analytics Tasks

The advancement of machine learning (ML) models has led to the development of ML-based approaches to improve numerous software engineering tasks in software maintenance and evolution. Nevertheless, research indicates that despite their potential successes, ML models may not be employed in real-world scenarios because they often remain a black box to practitioners, lacking explainability in their reasoning. Recently, various rule-based model-agnostic Explainable AI (XAI) techniques, such as PyExplainer and LIME, have been employed to explain the predictions of ML models in software analytics tasks. This paper assesses the ability of these techniques (e.g., PyExplainer and LIME) to generate reliable and consistent explanations for ML models across various software analytics tasks, including Just-in-Time (JIT) defect prediction, clone detection, and the classification of useful code review comments. Our manual investigations find inconsistencies and anomalies in the explanations generated by these techniques. Therefore, we design a novel framework: Evaluation of Explainable AI (EvaluateXAI), along with granular-level evaluation metrics, to automatically assess the effectiveness of rule-based XAI techniques in generating reliable and consistent explanations for ML models in software analytics tasks. After conducting in-depth experiments involving seven state-of-the-art ML models trained on five datasets and six evaluation metrics, we find that none of the evaluation metrics reached 100\%, indicating the unreliability of the explanations generated by XAI techniques. Additionally, PyExplainer and LIME failed to provide consistent explanations for 86.11% and 77.78% of the experimental combinations, respectively. Therefore, our experimental findings emphasize the necessity for further research in XAI to produce reliable and consistent explanations for ML models in software analytics tasks.

翻译：机器学习（ML）模型的进步推动了基于ML的方法的发展，以改进软件维护与演化中的众多软件工程任务。然而，研究表明，尽管ML模型可能取得潜在成功，但由于其推理过程通常对实践者而言仍是黑箱、缺乏可解释性，它们可能无法在实际场景中得到应用。近年来，各种基于规则的模型无关可解释人工智能（XAI）技术（如PyExplainer和LIME）已被用于解释ML模型在软件分析任务中的预测。本文评估了这些技术（例如PyExplainer和LIME）为不同软件分析任务（包括即时（JIT）缺陷预测、克隆检测以及有用代码审查评论分类）中的ML模型生成可靠且一致解释的能力。我们的手动调查发现这些技术生成的解释存在不一致性和异常。因此，我们设计了一个新颖的框架：可解释人工智能评估框架（EvaluateXAI），以及细粒度评估指标，以自动评估基于规则的XAI技术在为软件分析任务中的ML模型生成可靠且一致解释方面的有效性。在进行了涉及基于五个数据集训练的七个最先进ML模型和六项评估指标的深入实验后，我们发现所有评估指标均未达到100%，这表明XAI技术生成的解释不可靠。此外，PyExplainer和LIME分别未能为86.11%和77.78%的实验组合提供一致的解释。因此，我们的实验结果强调了在XAI领域进一步研究的必要性，以期为软件分析任务中的ML模型生成可靠且一致的解释。