Recent work in machine learning for healthcare has raised concerns about patient privacy and algorithmic fairness. For example, previous work has shown that patient self-reported race can be predicted from medical data that does not explicitly contain racial information. However, the extent of data identification is unknown, and we lack ways to develop models whose outcomes are minimally affected by such information. Here we systematically investigated the ability of time-series electronic health record data to predict patient static information. We found that not only the raw time-series data, but also learned representations from machine learning models, can be trained to predict a variety of static information with area under the receiver operating characteristic curve as high as 0.851 for biological sex, 0.869 for binarized age and 0.810 for self-reported race. Such high predictive performance can be extended to a wide range of comorbidity factors and exists even when the model was trained for different tasks, using different cohorts, using different model architectures and databases. Given the privacy and fairness concerns these findings pose, we develop a variational autoencoder-based approach that learns a structured latent space to disentangle patient-sensitive attributes from time-series data. Our work thoroughly investigates the ability of machine learning models to encode patient static information from time-series electronic health records and introduces a general approach to protect patient-sensitive attribute information for downstream tasks.
翻译:近期医疗领域机器学习的研究引发了关于患者隐私与算法公平性的担忧。例如,已有研究表明,即使医疗数据中未明确包含种族信息,也能从中预测患者自报种族。然而,数据可识别的程度尚不明确,我们也缺乏开发模型的方法来最小化此类信息对结果的影响。本文系统探究了时间序列电子健康记录数据对患者静态信息的预测能力。研究发现,不仅原始时间序列数据,就连机器学习模型学习到的表征,也能被训练用于预测多种静态信息——生物性别的受试者工作特征曲线下面积高达0.851,二值化年龄为0.869,自报种族为0.810。这种高预测性能可延伸至广泛的共病因素,且即便模型针对不同任务、不同队列、不同架构及数据库进行训练时依然存在。鉴于这些发现引发的隐私与公平性担忧,我们提出了一种基于变分自编码器的方法,该方法通过学习结构化的潜在空间,将患者敏感属性从时间序列数据中解耦。本研究深入探究了机器学习模型从时间序列电子健康记录中编码患者静态信息的能力,并提出了一种保护下游任务中患者敏感属性信息的通用方法。