Despite their success, Machine Learning (ML) models do not generalize effectively to data not originating from the training distribution. To reliably employ ML models in real-world healthcare systems and avoid inaccurate predictions on out-of-distribution (OOD) data, it is crucial to detect OOD samples. Numerous OOD detection approaches have been suggested in other fields - especially in computer vision - but it remains unclear whether the challenge is resolved when dealing with medical tabular data. To answer this pressing need, we propose an extensive reproducible benchmark to compare different methods across a suite of tests including both near and far OODs. Our benchmark leverages the latest versions of eICU and MIMIC-IV, two public datasets encompassing tens of thousands of ICU patients in several hospitals. We consider a wide array of density-based methods and SOTA post-hoc detectors across diverse predictive architectures, including MLP, ResNet, and Transformer. Our findings show that i) the problem appears to be solved for far-OODs, but remains open for near-OODs; ii) post-hoc methods alone perform poorly, but improve substantially when coupled with distance-based mechanisms; iii) the transformer architecture is far less overconfident compared to MLP and ResNet.
翻译:尽管机器学习(ML)模型取得了成功,但它们无法有效泛化到训练数据分布以外的数据。为了在现实医疗系统中可靠地部署ML模型并避免对分布外(OOD)数据进行不准确的预测,检测OOD样本至关重要。其他领域(尤其是计算机视觉)已提出了多种OOD检测方法,但尚不清楚在医学表格数据中该挑战是否得到解决。为满足这一迫切需求,我们提出一个广泛可复现的基准测试,用于比较不同方法在包括近OOD和远OOD在内的系列测试中的表现。我们的基准测试利用了最新版本的eICU和MIMIC-IV这两个包含多家医院数万名ICU患者的公共数据集。我们考虑了多种基于密度的方法和当前最先进的(SOTA)事后检测器,涵盖MLP、ResNet和Transformer等不同预测架构。研究发现:i)远OOD问题似乎已解决,但近OOD问题仍有待突破;ii)单独使用事后检测器表现较差,但与基于距离的机制结合后性能显著提升;iii)Transformer架构相比MLP和ResNet具有更低的过度自信倾向。