The machine learning modeling process conventionally culminates in selecting a single model that maximizes a selected performance metric. However, this approach leads to abandoning a more profound analysis of slightly inferior models. Particularly in medical and healthcare studies, where the objective extends beyond predictions to valuable insight generation, relying solely on a single model can result in misleading or incomplete conclusions. This problem is particularly pertinent when dealing with a set of models known as $\textit{Rashomon set}$, with performance close to maximum one. Such a set can be numerous and may contain models describing the data in a different way, which calls for comprehensive analysis. This paper introduces a novel process to explore models in the Rashomon set, extending the conventional modeling approach. We propose the $\texttt{Rashomon_DETECT}$ algorithm to detect models with different behavior. It is based on recent developments in the eXplainable Artificial Intelligence (XAI) field. To quantify differences in variable effects among models, we introduce the Profile Disparity Index (PDI) based on measures from functional data analysis. To illustrate the effectiveness of our approach, we showcase its application in predicting survival among hemophagocytic lymphohistiocytosis (HLH) patients - a foundational case study. Additionally, we benchmark our approach on other medical data sets, demonstrating its versatility and utility in various contexts. If differently behaving models are detected in the Rashomon set, their combined analysis leads to more trustworthy conclusions, which is of vital importance for high-stakes applications such as medical applications.
翻译:机器学习建模过程通常以选择单一模型、最大化特定性能指标作为终点。然而,这种做法会放弃对性能略次模型的深入分析。尤其是在医疗健康研究中,当目标超越预测本身、转向生成有价值见解时,仅依赖单一模型可能导致误导性或不完整的结论。这一问题在处理被称为“拉什蒙集”(性能接近最优的一类模型集合)时尤为突出。该集合可能规模庞大,且包含以不同方式描述数据的模型,亟需全面分析。本文提出一种新型流程,用于探索拉什蒙集中的模型,从而拓展传统建模方法。我们设计了$\texttt{Rashomon_DETECT}$算法,基于可解释人工智能(XAI)领域的最新进展,来检测行为差异的模型。为量化模型间变量效应的差异,我们基于函数数据分析的度量方法,提出了剖面差异指数(PDI)。为验证方法有效性,我们以噬血细胞性淋巴组织细胞增多症(HLH)患者生存预测作为基础案例研究展示其应用。此外,我们还在其他医疗数据集上对方法进行基准测试,证明其在多种场景下的通用性与实用性。若在拉什蒙集中检测到行为不同的模型,对其进行联合分析将得出更可信的结论,这对医疗等高风险应用至关重要。