The paper researches the problem of representation learning for electronic health records. We present the patient histories as temporal sequences of diseases for which embeddings are learned in an unsupervised setup with a transformer-based neural network model. Additionally the embedding space includes demographic parameters which allow the creation of generalized patient profiles and successful transfer of medical knowledge to other domains. The training of such a medical profile model has been performed on a dataset of more than one million patients. Detailed model analysis and its comparison with the state-of-the-art method show its clear advantage in the diagnosis prediction task. Further, we show two applications based on the developed profile model. First, a novel Harbinger Disease Discovery method allowing to reveal disease associated hypotheses and potentially are beneficial in the design of epidemiological studies. Second, the patient embeddings extracted from the profile model applied to the insurance scoring task allow significant improvement in the performance metrics.
翻译:本文研究电子健康记录的表征学习问题。我们将患者病史呈现为疾病的时间序列,通过基于Transformer的神经网络模型在无监督设定下学习这些序列的嵌入表示。此外,嵌入空间还包含人口统计学参数,这使得能够创建通用患者画像,并成功将医学知识迁移至其他领域。该医疗画像模型的训练基于超过一百万名患者的数据集。详细的模型分析及其与当前最优方法的比较表明,该模型在诊断预测任务中具有明显优势。进一步,我们展示了基于所开发画像模型的两种应用:其一,一种新型"先兆疾病发现"方法,能够揭示疾病关联假说,并可能有益于流行病学研究设计;其二,从画像模型中提取的患者嵌入应用于保险评分任务,可显著提升性能指标。