EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why -- Measuring Mechanistic Multiplicity Across Training Runs

Machine learning models are primarily judged by predictive performance, especially in applied settings. Once a model reaches high accuracy, its explanation is often assumed to be correct and trustworthy. This assumption raises an overlooked question: when two models achieve high accuracy, do they rely on the same internal logic, or do they reach the same outcome via different and potentially competing mechanisms? We introduce EvoXplain, a diagnostic framework that measures the stability of model explanations across repeated training. Rather than analysing the explanation of a single trained model, EvoXplain treats explanations as samples drawn from the training and model selection pipeline itself, without aggregating predictions or constructing ensembles. It examines whether these samples form a single coherent explanatory basin or separate into multiple structured explanatory basins. We evaluate EvoXplain on the Adult Income and Breast Cancer datasets using deep neural networks and Logistic Regression. Although all models achieve high predictive accuracy, explanation stability differs across pipelines. Deep neural networks on Breast Cancer converge to a single explanatory basin, while the same architecture on Adult Income separates into distinct explanatory basins despite identical training conditions. Logistic Regression on Breast Cancer exhibits conditional multiplicity, where basin accessibility is controlled by regularisation configuration. EvoXplain does not attempt to select a correct explanation. Instead, it makes explanatory structure visible and quantifiable, revealing when single instance explanations obscure the existence of multiple admissible predictive mechanisms. More broadly, EvoXplain reframes interpretability as a property of the training pipeline under repeated instantiation, rather than of any single trained model.

翻译：机器学习模型主要依据预测性能进行评估，尤其在应用场景中。一旦模型达到高精度，其解释通常被假定为正确且可信。这一假设引发了一个被忽视的问题：当两个模型均实现高精度时，它们是否依赖相同的内部逻辑，抑或是通过不同甚至可能相互竞争的机制达成相同结果？本文提出EvoXplain——一个通过重复训练衡量模型解释稳定性的诊断框架。与分析单一训练模型的解释不同，EvoXplain将解释视为从训练和模型选择流程本身抽取的样本，既不聚合预测也不构建集成模型。该框架检验这些样本是形成单一连贯的解释流域，还是分离为多个结构化解释流域。我们在成人收入与乳腺癌数据集上使用深度神经网络和逻辑回归评估EvoXplain。尽管所有模型均达到高预测精度，但不同流程的解释稳定性存在差异：乳腺癌数据集上的深度神经网络收敛至单一解释流域，而相同架构在成人收入数据集上（尽管训练条件完全相同）却分离为不同解释流域；乳腺癌数据集上的逻辑回归则表现出条件多重性——其流域可及性受正则化配置调控。EvoXplain并不试图选择正确解释，而是通过可视化和量化解释结构，揭示单例解释何时掩盖了多重可接受预测机制的存在。更广泛而言，EvoXplain将可解释性重新定义为重复实例化下训练流程的属性，而非任何单一训练模型的特性。