Despite the wide use of explainability techniques to attempt to understand the behavior of Artificial Intelligence (AI), the generated explanations may not always be reliable. An explanation can appear plausible to humans but fail to capture the internal reasoning of a model, particularly when dealing with complex tabular data. This paper studies the trustworthiness of local explainability techniques when applied to complex tabular classification tasks, considering evaluated metrics for three main properties: faithfulness to the model's predictions, robustness to input data variations, and complexity of the explanation itself. A benchmark was performed for Local Interpretable Model-Agnostic Explanations (LIME), Kernel SHapley Additive exPlanations (SHAP), and Feature Ablation techniques, across 32 datasets and different types of machine learning models. Model performance ranges were analyzed to identify two groups: consensus-correct, which are samples that all models predicted correctly, and consensus-wrong, samples that all models predicted incorrectly. The obtained results demonstrate that that the explanations are not always correlated with a model's predictive performance. Instead, dataset complexity and feature distributions seem to be the main factors affecting explanation quality and reliability.
翻译:尽管可解释性技术被广泛用于理解人工智能(AI)的行为,但生成的解释并不总是可靠的。一个解释可能对人类看似合理,却未能捕捉模型的内部推理机制,尤其是在处理复杂表格数据时。本文研究了局部可解释性技术在复杂表格分类任务中的可信度,从三个主要属性评估了相关度量指标:对模型预测的忠实度、对输入数据变化的鲁棒性以及解释本身的复杂度。我们对局部可解释模型无关解释(LIME)、核Shapley加法解释(SHAP)以及特征消融技术进行了基准测试,涵盖32个数据集和不同类型的机器学习模型。通过分析模型性能范围,区分了两类样本:共识正确(所有模型均正确预测的样本)与共识错误(所有模型均错误预测的样本)。研究结果表明,解释质量与模型的预测性能并不总是相关,数据集复杂度和特征分布才是影响解释质量和可靠性的主要因素。