Background: Symbolic models, particularly decision trees, are widely used in software engineering for explainable analytics in defect prediction, configuration tuning, and software quality assessment. Most of these models rely on correlational split criteria, such as variance reduction or information gain, which identify statistical associations but cannot imply causation between X and Y. Recent empirical studies in software engineering show that both correlational models and causal discovery algorithms suffer from pronounced instability. This instability arises from two complementary issues: 1-Correlation-based methods conflate association with causation. 2-Causal discovery algorithms rely on heuristic approximations to cope with the NP-hard nature of structure learning, causing their inferred graphs to vary widely under minor input perturbations. Together, these issues undermine trust, reproducibility, and the reliability of explanations in real-world SE tasks. Objective: This study investigates whether incorporating causality-aware split criteria into symbolic models can improve their stability and robustness, and whether such gains come at the cost of predictive or optimization performance. We additionally examine how the stability of human expert judgments compares to that of automated models. Method: Using 120+ multi-objective optimization tasks from the MOOT repository of multi-objective optimization tasks, we evaluate stability through a preregistered bootstrap-ensemble protocol that measures variance with win-score assignments. We compare the stability of human causal assessments with correlation-based decision trees (EZR). We would also compare the causality-aware trees, which leverage conditional-entropy split criteria and confounder filtering. Stability and performance differences are analyzed using statistical methods (variance, Gini Impurity, KS test, Cliff's delta)
翻译:背景:符号模型,特别是决策树,在软件工程中广泛用于缺陷预测、配置调优和软件质量评估等可解释性分析。这些模型大多依赖相关性分裂准则,如方差缩减或信息增益,这些准则识别统计关联但无法推断X与Y之间的因果关系。软件工程领域的最新实证研究表明,相关性模型与因果发现算法均存在显著的不稳定性。这种不稳定性源于两个互补的问题:1)基于相关性的方法将关联与因果混为一谈;2)因果发现算法依赖启发式近似来处理结构学习的NP难特性,导致其推断的图在微小输入扰动下产生巨大差异。这些问题共同削弱了实际软件工程任务中的可信度、可复现性及解释的可靠性。目的:本研究探讨将因果感知分裂准则纳入符号模型是否能提升其稳定性与鲁棒性,以及此类增益是否以预测或优化性能为代价。我们进一步比较人类专家判断的稳定性与自动化模型的稳定性。方法:利用来自多目标优化任务库MOOT的120多项多目标优化任务,我们通过预先注册的自举集成协议评估稳定性,该协议使用胜率得分分配度量方差。我们将人类因果评估的稳定性与基于相关性的决策树(EZR)进行比较,并同时评估利用条件熵分裂准则和混杂因子过滤的因果感知树。稳定性与性能差异采用统计方法(方差、基尼不纯度、KS检验、Cliff's delta)进行分析。