Informative Missingness to Generate Irregular Clinical Time Series

Laboratory tests in electronic health records are collected irregularly, and the absence of a test order can be as informative as the measurement itself. Such missingness reflects clinicians' decisions and patient physiology, making it important to model it directly rather than treat it as a preprocessing artifact. Here we present a diffusion-based approach for generating clinical time series that jointly models laboratory values and their observation patterns using the public Data Analytics Challenge on Missing Data Imputation (DACMI) benchmark derived from MIMIC-III. To preserve realistic sampling, we align chart times into 4-hour intervals and segment admissions into 7-day windows, producing trajectories that pair each lab value with a corresponding observation indicator. Standard transformations and normalization are applied to stabilize training. Our method extends the TimeDiff framework to learn continuous lab values and discrete missingness patterns through complementary diffusion objectives. Experiments show that the generated data closely match real patient trajectories across individual lab distributions and joint value-missingness embeddings, demonstrating that diffusion models can capture clinically meaningful dependencies between patient physiology and clinicians' testing behavior under MNAR-like (missing-not-at-random) missingness. These preliminary results indicate that our model can serve as an initial component toward developing clinical foundation models. By producing synthetic priors that preserve key physiology-missingness relationships, this work motivates the subsequent training of Prior-Data Fitted Networks capable of leveraging informative missingness, which we will investigate in the extended work.

翻译：电子健康记录中的实验室检测通常是不规则采集的，检测医嘱的缺失本身可能与测量值一样具有信息量。这种缺失反映了临床医生的决策和患者生理状态，因此应直接对其进行建模，而非将其视为预处理中的伪影。本文提出一种基于扩散的方法来生成临床时间序列，该方法使用MIMIC-III数据集的公共缺失数据插补挑战（DACMI）基准，联合建模实验室检测值及其观测模式。为保留真实采样特性，我们将图表时间对齐为4小时间隔，并将住院周期分割为7天窗口，生成每条轨迹中每个实验室值对应一个观测指标的数据对。采用标准化变换和归一化处理以稳定训练过程。通过互补的扩散目标，本方法扩展了TimeDiff框架以学习连续实验室值和离散缺失模式。实验表明，生成数据在单个实验室分布及联合值-缺失嵌入空间中与真实患者轨迹高度吻合，证明扩散模型能够捕捉MNAR（非随机缺失）类缺失机制下患者生理与临床诊疗行为之间的临床相关性依赖。初步结果表明，本模型可作为开发临床基础模型的初始组件。通过生成保留关键生理-缺失关系的合成先验，本研究为后续训练能够利用信息性缺失的先验数据拟合网络奠定基础，相关工作将在扩展研究中进一步探索。