Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite-dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data that enables statistical analysis without exposing sensitive real data. Under a copula framework, SFM constructs a parsimonious smooth flow to generate infinite-dimensional functional data, free of Gaussianity and low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream tasks, highlighting its potential to boost the utility of EHR data for clinical applications.
翻译:功能型数据,即在连续域上观测到的光滑随机函数,在生物医学研究、健康信息学和流行病学等领域日益普及。然而,隐私约束、稀疏不规则采样、无限维性和非高斯结构等挑战常阻碍功能型数据的有效统计分析。为应对这些挑战,我们提出名为光滑流匹配(Smooth Flow Matching, SFM)的新框架,该框架专为功能型数据的生成式建模设计,可在不暴露真实敏感数据的情况下实现统计分析。在copula框架下,SFM构建简约光滑流以生成无限维功能型数据,无需高斯性或低秩假设。该方法计算高效、可处理不规则观测值,并保证生成函数的光滑性,在现有深度生成方法不适用的场景中提供了实用灵活的解决方案。通过大规模仿真研究,我们展示了SFM在合成数据质量和计算效率方面的优势,并将其应用于MIMIC-IV患者电子健康记录(EHR)纵向数据库的临床轨迹数据生成。分析结果表明,SFM能为下游任务生成高质量替代数据,凸显了其在提升EHR数据临床应用价值方面的潜力。