We present a model-agnostic framework for jointly optimizing the predictive performance and interpretability of supervised machine learning models for tabular data. Interpretability is quantified via three measures: feature sparsity, interaction sparsity of features, and sparsity of non-monotone feature effects. By treating hyperparameter optimization of a machine learning algorithm as a multi-objective optimization problem, our framework allows for generating diverse models that trade off high performance and ease of interpretability in a single optimization run. Efficient optimization is achieved via augmentation of the search space of the learning algorithm by incorporating feature selection, interaction and monotonicity constraints into the hyperparameter search space. We demonstrate that the optimization problem effectively translates to finding the Pareto optimal set of groups of selected features that are allowed to interact in a model, along with finding their optimal monotonicity constraints and optimal hyperparameters of the learning algorithm itself. We then introduce a novel evolutionary algorithm that can operate efficiently on this augmented search space. In benchmark experiments, we show that our framework is capable of finding diverse models that are highly competitive or outperform state-of-the-art XGBoost or Explainable Boosting Machine models, both with respect to performance and interpretability.
翻译:我们提出一种模型无关的框架,用于联合优化表格数据监督机器学习模型的预测性能与可解释性。可解释性通过三个指标量化:特征稀疏性、特征交互稀疏性及非单调特征效应的稀疏性。将机器学习算法的超参数优化视为多目标优化问题,该框架可在单次优化运行中生成在高效能与易解释性之间进行权衡的多样化模型。通过将特征选择、交互约束与单调性约束纳入学习算法的超参数搜索空间,实现了高效优化。研究表明,该优化问题可有效转化为寻找允许在模型中交互的选定特征组的帕累托最优集合,同时确定这些特征的最优单调性约束及学习算法本身的超参数。随后我们提出一种新型进化算法,可在此扩展搜索空间上高效运行。基准实验表明,该框架能够发现多样化模型,其在性能与可解释性方面均与先进的XGBoost或可解释提升机模型高度竞争甚至超越之。