Research on differentially private synthetic tabular data has largely focused on independent and identically distributed rows where each record corresponds to a unique individual. This perspective neglects the temporal complexity in longitudinal datasets, such as electronic health records, where a user contributes an entire (sub) table of sequential events. While practitioners might attempt to model such data by flattening user histories into high-dimensional vectors for use with standard marginal-based mechanisms, we demonstrate that this strategy is insufficient. Flattening fails to preserve temporal coherence even when it maintains valid marginal distributions. We introduce PATH, a novel generative framework that treats the full table as the unit of synthesis and leverages the autoregressive capabilities of privately fine-tuned large language models. Extensive evaluations show that PATH effectively captures long-range dependencies that traditional methods miss. Empirically, our method reduces the distributional distance to real trajectories by over 60% and reduces state transition errors by nearly 50% compared to leading marginal mechanisms while achieving similar marginal fidelity.
翻译:差分隐私合成表格数据的研究主要集中于独立同分布的行数据,其中每条记录对应一个独立个体。这种视角忽略了纵向数据集(如电子健康记录)中的时序复杂性,在此类数据中每个用户贡献的是包含连续事件的完整(子)表格。尽管实践者可能尝试通过将用户历史展平为高维向量以适配基于边际分布的标准机制来建模此类数据,但我们证明这种策略存在不足。即使展平操作能保持有效的边际分布,它仍无法维持时序连贯性。我们提出了PATH——一种创新的生成框架,该框架将完整表格作为合成单元,并利用经过隐私微调的大语言模型的自回归能力。大量评估表明,PATH能有效捕捉传统方法所忽略的长程依赖关系。实证结果显示,与主流边际机制相比,我们的方法将真实轨迹的分布距离降低了60%以上,将状态转移误差减少了近50%,同时实现了相近的边际保真度。