Applying Artificial Intelligence (AI) and Machine Learning (ML) in critical contexts, such as medicine, requires the implementation of safety measures to reduce risks of harm in case of prediction errors. Spotting ML failures is of paramount importance when ML predictions are used to drive clinical decisions. ML predictive reliability measures the degree of trust of a ML prediction on a new instance, thus allowing decision-makers to accept or reject it based on its reliability. To assess reliability, we propose a method that implements two principles. First, our approach evaluates whether an instance to be classified is coming from the same distribution of the training set. To do this, we leverage Autoencoders (AEs) ability to reconstruct the training set with low error. An instance is considered Out-of-Distribution (OOD) if the AE reconstructs it with a high error. Second, it is evaluated whether the ML classifier has good performances on samples similar to the newly classified instance by using a proxy model. We show that this approach is able to assess reliability both in a simulated scenario and on a model trained to predict disease progression of Multiple Sclerosis patients. We also developed a Python package, named relAI, to embed reliability measures into ML pipelines. We propose a simple approach that can be used in the deployment phase of any ML model to suggest whether to trust predictions or not. Our method holds the promise to provide effective support to clinicians by spotting potential ML failures during deployment.
翻译:在医学等关键领域应用人工智能(AI)和机器学习(ML)时,必须实施安全措施以降低预测错误造成伤害的风险。当ML预测被用于驱动临床决策时,识别ML失败至关重要。ML预测可靠性衡量的是ML对新实例预测的可信度,从而使决策者能够根据其可靠性决定接受或拒绝该预测。为了评估可靠性,我们提出了一种实现两个原则的方法。首先,我们的方法评估待分类实例是否来自与训练集相同的分布。为此,我们利用自编码器(AE)以低误差重构训练集的能力。若AE以高误差重构某实例,则认为该实例属于分布外(OOD)。其次,通过使用代理模型,评估ML分类器在与新分类实例相似的样本上是否具有良好性能。我们证明,该方法既能在模拟场景中评估可靠性,也能在预测多发性硬化症患者疾病进展的模型中评估可靠性。此外,我们还开发了一个名为relAI的Python包,将可靠性度量嵌入ML流水线。我们提出了一种简单的方法,可在任何ML模型的部署阶段使用,以建议是否信任其预测。我们的方法有望在部署过程中通过识别潜在ML失败,为临床医生提供有效支持。