We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation. Machine learning models trained on this data will learn nothing else but to exactly reconstruct the known target definition. Such models show perfect performance on similarly constructed test data but will fail catastrophically on real-world examples where the defining fundamental measurements are not or only incompletely available. We present a general procedure allowing identification of problematic datasets and black-box machine learning models trained on them, and exemplify our detection procedure on the task of early prediction of sepsis.
翻译:我们揭示了机器学习在医学疾病诊断这一关键应用领域中的有效性问题。当训练数据中的目标标签通过间接测量确定,而确定该间接测量所需的基本测量被包含在输入数据表示中时,问题便会出现。基于此类数据训练的机器学习模型只会学习精确重构已知的目标定义。这类模型在类似构建的测试数据上表现完美,但在基本测量数据不完整或缺失的真实世界实例中会彻底失效。我们提出了一种通用程序,能够识别存在问题的数据集以及基于这些数据训练的黑箱机器学习模型,并以脓毒症早期预测任务为例,展示了我们的检测程序。