Monitoring the health status of patients and predicting mortality in advance is vital for providing patients with timely care and treatment. Massive medical signs in electronic health records (EHR) are fitted into advanced machine learning models to make predictions. However, the data-quality problem of original clinical signs is less discussed in the literature. Based on an in-depth measurement of the missing rate and correlation score across various medical signs and a large amount of patient hospital admission records, we discovered the comprehensive missing rate is extremely high, and a large number of useless signs could hurt the performance of prediction models. Then we concluded that only improving data-quality could improve the baseline accuracy of different prediction algorithms. We designed MEDLENS, with an automatic vital medical signs selection approach via statistics and a flexible interpolation approach for high missing rate time series. After augmenting the data-quality of original medical signs, MEDLENS applies ensemble classifiers to boost the accuracy and reduce the computation overhead at the same time. It achieves a very high accuracy performance of 0.96% AUC-ROC and 0.81% AUC-PR, which exceeds the previous benchmark.
翻译:监测患者健康状况并提前预测死亡风险,对于为患者提供及时护理和治疗至关重要。电子健康记录(EHR)中的海量医学体征被应用于先进的机器学习模型进行预测。然而,文献中对原始临床体征的数据质量问题讨论较少。基于对多种医学体征及大量患者住院记录中缺失率与相关性的深入测量,我们发现综合缺失率极高,且大量无用体征会损害预测模型的性能。进而我们得出结论:仅通过提升数据质量即可改善不同预测算法的基线准确率。我们设计了MEDLENS,通过基于统计的自动生命体征选择方法,以及针对高缺失率时间序列的灵活插值方法,提升原始医学体征的数据质量后,采用集成分类器同时提升准确率并降低计算开销。该方法实现了0.96% AUC-ROC和0.81% AUC-PR的极高准确率,超越了先前基准。