In recent years, Machine Learning (ML) has seen widespread adoption across a broad range of sectors, including high-stakes domains such as healthcare, finance, and law. This growing reliance has raised increasing concerns regarding model interpretability and accountability, particularly as legal and regulatory frameworks place tighter constraints on using black-box models in critical applications. Although interpretable ML has attracted substantial attention, systematic evaluations of inherently interpretable models, especially for tabular data, remain relatively scarce and often focus primarily on aggregated performance outcomes. To address this gap, we present a large-scale comparative evaluation of 16 inherently interpretable methods, ranging from classical linear models and decision trees to more recent approaches such as Explainable Boosting Machines (EBMs), Symbolic Regression (SR), and Generalized Optimal Sparse Decision Trees (GOSDT). Our study spans 216 real-world tabular datasets and goes beyond aggregate rankings by stratifying performance according to structural dataset characteristics, including dimensionality, sample size, linearity, and class imbalance. In addition, we assess training time and robustness under controlled distributional shifts. Our results reveal clear performance hierarchies, especially for regression tasks, where EBMs consistently achieve strong predictive accuracy. At the same time, we show that performance is highly context-dependent: SR and Interpretable Generalized Additive Neural Networks (IGANNs) perform particularly well in non-linear regimes, while GOSDT models exhibit pronounced sensitivity to class imbalance. Overall, these findings provide practical guidance for practitioners seeking a balance between interpretability and predictive performance, and contribute to a deeper empirical understanding of interpretable modeling for tabular data.
翻译:近年来,机器学习(ML)在包括医疗、金融和法律等高风险领域在内的广泛行业中得到普遍应用。这种日益增长的依赖引发了对模型可解释性和责任性的持续关注,尤其是在法律和监管框架对关键应用中黑箱模型的使用施加更严格限制的背景下。尽管可解释机器学习已引起广泛关注,但对内在可解释模型的系统性评估,特别是针对表格数据的评估,仍然相对缺乏,且往往主要关注聚合性能结果。为填补这一空白,我们开展了对16种内在可解释方法的大规模比较评估,涵盖从经典线性模型和决策树到更近期的方法,如可解释提升机(EBMs)、符号回归(SR)和广义最优稀疏决策树(GOSDT)。我们的研究覆盖216个真实世界表格数据集,并超越简单的聚合排名,通过根据数据集的结构特征(包括维度、样本量、线性度和类别不平衡)对性能进行分层分析。此外,我们还评估了训练时间以及在受控分布偏移下的鲁棒性。我们的结果揭示了清晰的性能层次,特别是在回归任务中,EBMs始终展现出强大的预测准确性。同时,我们发现性能高度依赖于具体情境:SR和可解释广义加性神经网络(IGANNs)在非线性场景下表现尤为出色,而GOSDT模型则对类别不平衡表现出显著的敏感性。总体而言,这些发现为寻求在可解释性与预测性能之间取得平衡的实践者提供了实用指导,并有助于深化对表格数据可解释建模的实证理解。