We introduce the Conditional Self-Attention Imputation (CSAI) model, a novel recurrent neural network architecture designed to address the challenges of complex missing data patterns in multivariate time series derived from hospital electronic health records (EHRs). CSAI extends state-of-the-art neural network-based imputation by introducing key modifications specific to EHR data: a) attention-based hidden state initialisation to capture both long- and short-range temporal dependencies prevalent in EHRs, b) domain-informed temporal decay to mimic clinical data recording patterns, and c) a non-uniform masking strategy that models non-random missingness by calibrating weights according to both temporal and cross-sectional data characteristics. Comprehensive evaluation across four EHR benchmark datasets demonstrates CSAI's effectiveness compared to state-of-the-art architectures in data restoration and downstream tasks. CSAI is integrated into PyPOTS, an open-source Python toolbox designed for machine learning tasks on partially observed time series. This work significantly advances the state of neural network imputation applied to EHRs by more closely aligning algorithmic imputation with clinical realities.
翻译:本文提出条件自注意力填补(CSAI)模型,这是一种新颖的循环神经网络架构,旨在解决医院电子健康记录(EHR)衍生的多元时间序列中复杂的缺失数据模式所带来的挑战。CSAI通过对EHR数据引入关键改进,扩展了当前最先进的基于神经网络的填补方法:a) 基于注意力的隐藏状态初始化,以捕捉EHR中普遍存在的长程和短程时间依赖关系;b) 融入领域知识的时序衰减机制,以模拟临床数据记录模式;c) 一种非均匀掩码策略,通过根据时间和横截面数据特征校准权重来建模非随机缺失。在四个EHR基准数据集上的综合评估表明,在数据恢复和下游任务中,CSAI相较于最先进的架构具有显著优势。CSAI已集成至PyPOTS——一个专为部分观测时间序列的机器学习任务设计的开源Python工具箱。本研究通过使算法填补更贴近临床现实,显著推进了神经网络填补在EHR应用中的发展水平。