Explainable AI for Data-Driven Design of High-Dimensional Predictive Studies

Predictive modelling is important for health data analysis and data-driven clinical decision-making. However, predictive studies are challenging to design optimally by hand when tens or even hundreds of features require selection, transformation, or interaction modelling. While complex machine learning models offer high performance, their "black-box" nature limits the clinical trust, transparency, and interpretability required for decision-making. We developed and evaluated an Exploratory AI Recommender that provides data-driven recommendations to improve predictive performance of existing interpretable statistical models. The developed framework uses flexible AI modelling to capture complex data patterns and explainable AI techniques to translate the patterns into three recommendation types: feature exclusion, non-linear terms, and feature interactions. We evaluated the framework by comparing predictive performance of a baseline (i.e., no interactions or non-linear terms) Cox Proportional Hazards (CPH) model against an augmented CPH incorporating recommendations suggested by our method. The primary analysis predicts the time to the first occurrence of a fall or related injury in 245,614 patients. Our method recommended excluding 23 features, including non-linear terms for two features, and including 221 suggested feature interactions. The C-index improved from 0.805 (95% CI 0.798-0.812) to 0.815 (95% CI 0.809-0.822), and so did calibration (intercept: -0.006 to 0.003; slope: 1.063 to 0.950). All recommendations were supported by existing literature. The method also proved effective on two additional public datasets, demonstrating wider applicability. The proposed Exploratory AI Recommender demonstrates the potential of explainable AI and data-driven study design to improve the process of developing, and the performance of high-dimensional transparent predictive models.

翻译：预测性建模在健康数据分析与数据驱动的临床决策中具有重要意义。然而，当需要从数十乃至数百个特征中进行选择、变换或交互建模时，手工优化设计预测性研究极具挑战性。尽管复杂机器学习模型性能优异，但其"黑箱"特性限制了临床决策所需的可信度、透明度和可解释性。我们开发并评估了一种探索性人工智能推荐系统，旨在通过数据驱动建议提升现有可解释统计模型的预测性能。该框架采用灵活的人工智能建模捕获复杂数据模式，并利用可解释人工智能技术将这些模式转化为三类推荐：特征排除、非线性项及特征交互。我们通过比较基线Cox比例风险模型（不含交互项或非线性项）与采纳本方法建议的增强型CPH模型的预测性能来评估该框架。主要分析针对245,614名患者预测首次跌倒或相关损伤发生时间。本方法推荐排除23项特征，对两个特征引入非线性项，并纳入221项建议的特征交互。C指数从0.805（95% CI 0.798–0.812）提升至0.815（95% CI 0.809–0.822），校准度同步改善（截距：-0.006至0.003；斜率：1.063至0.950）。所有推荐均得到现有文献支持。该方法在两个公开数据集上同样表现有效，验证了其广泛适用性。本探索性人工智能推荐系统展示了可解释人工智能与数据驱动研究设计在优化高维透明预测模型开发过程及提升性能方面的潜力。