RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

from arxiv, 39 pages, 7 figures, 15 tables. Code at https://github.com/rohithreddybc/rised-healthcare-eval and dataset at https://doi.org/10.57967/hf/8734 (Hugging Face). To be submitted to Expert Systems with Applications (Elsevier)

Clinical decision-support systems are expert systems whose recommendations clinicians act on directly, yet they are usually cleared on one aggregate accuracy number from a held-out test set. That number says nothing about input reliability under encoding shifts, subgroup gaps, threshold sensitivity, or operational feasibility. We present RISED, a pre-deployment evaluation framework operationalising five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) through BCa bootstrap 95% confidence intervals, literature-grounded thresholds, and Holm-Bonferroni-corrected PASS / FAIL / INCONCLUSIVE verdicts; Equity is a proxy-dependence diagnostic rather than a gating test. Applied to seven cohorts spanning 35 years (n from 303 to 99,492), RISED surfaces failures invisible to AUROC: on Diabetes 130, Reliability passes by three orders of magnitude (PSS = 0.0004) while Inclusivity (AUC parity gap = 0.262) and Sensitivity (max threshold-flip rate 49.1%) fail decisively; both NHIS cohorts reproduce this. NHANES 2021-2023, with a complete feature profile, achieves INCONCLUSIVE verdicts; BRFSS 2024 produces the suite's most severe Sensitivity failure (max threshold-flip rate 64.2%) after instrument rotation removed hypertension and cholesterol. The pattern recurs on credit- and income-prediction cohorts, confirming domain-agnosticity; a multi-model check shows the failures are data-driven, not model-specific. RISED ships as an open-source Python package complementing TRIPOD+AI, FUTURE-AI, and Fairlearn with the structured numerical evidence those standards require but do not prescribe.

翻译：临床决策支持系统是专家系统，临床医生直接依据其建议行动，然而此类系统通常仅通过保留测试集上的单一聚合准确率指标获得批准。该指标无法反映编码偏移下的输入可靠性、亚组差距、阈值敏感性或操作可行性。我们提出RISED，一种部署前评估框架，通过BCa自助法95%置信区间、基于文献的阈值以及经Holm-Bonferroni校正的PASS / FAIL / INCONCLUSIVE判决，在五个维度（可靠性、包容性、敏感性、公平性、可部署性）上实现可操作化；其中公平性作为代理依赖诊断而非门控测试。在跨越35年的七个队列（样本量从303至99,492）中应用时，RISED揭示了AUROC无法发现的失效模式：在糖尿病130数据集上，可靠性以三个数量级通过检验（PSS = 0.0004），而包容性（AUC差异率 = 0.262）和敏感性（最大阈值翻转率49.1%）明确失败；两个NHIS队列重现了该结果。NHANES 2021-2023数据集因特征轮廓完整，获得INCONCLUSIVE判决；BRFSS 2024数据集在仪器旋转去除高血压和胆固醇后，产生了该套件中最严重的敏感性失效（最大阈值翻转率64.2%）。该模式在信用预测和收入预测队列中重现，证实了其领域无关性；多模型检验表明失效由数据驱动而非模型特化。RISED作为开源Python工具包发布，通过提供TRIPOD+AI、FUTURE-AI和Fairlearn等标准所要求但未具体规定的结构化数值证据，对现有框架形成补充。