Machine learning (ML) models show strong promise for new biomedical prediction tasks, but concerns about trustworthiness have hindered their clinical adoption. In particular, it is often unclear whether a model relies on true clinical cues or on spurious hierarchical correlations in the data. This paper introduces a simple yet broadly applicable trustworthiness test grounded in stochastic proof-by-contradiction. Instead of just showing high test performance, our approach trains and tests on spurious labels carefully permuted based on a potential outcomes framework. A truly trustworthy model should fail under such label permutation; comparable accuracy across real and permuted labels indicates overfitting, shortcut learning, or data leakage. Our approach quantifies this behavior through interpretable Fisher-style p-values, which are well understood by domain experts across medical and life sciences. We evaluate our approach on multiple new bacterial diagnostics to separate tasks and models learning genuine causal relationships from those driven by dataset artifacts or statistical coincidences. Our work establishes a foundation to build rigor and trust between ML and life-science research communities, moving ML models one step closer to clinical adoption.
翻译:机器学习模型在新型生物医学预测任务中展现出巨大潜力,但其可信度问题阻碍了临床转化应用。核心问题在于,模型究竟依赖于真实的临床特征,还是数据中虚假的层次相关性往往难以辨明。本文提出一种基于随机反证法的简洁而广泛适用的可信度检验方法。区别于仅展示高测试性能的传统范式,我们的方法基于潜在结果框架对虚假标签进行精心置换,并在此类标签上进行训练与测试。真正可信的模型应在标签置换条件下失效;若模型在真实标签与置换标签上表现相近,则表明存在过拟合、捷径学习或数据泄露问题。我们通过可解释的费希尔式p值量化这一行为,该指标已广泛获得医学与生命科学领域专家的共识。我们在多种新型细菌诊断任务中评估本方法,成功区分了学习真实因果关系的任务与模型,以及那些受数据集伪影或统计巧合驱动的案例。本研究为建立机器学习与生命科学研究共同体间的严谨性与信任奠定了方法论基础,推动机器学习模型向临床转化迈出关键一步。