The proliferation of data in recent years has led to the advancement and utilization of various statistical and deep learning techniques, thus expediting research and development activities. However, not all industries have benefited equally from the surge in data availability, partly due to legal restrictions on data usage and privacy regulations, such as in medicine. To address this issue, various statistical disclosure and privacy-preserving methods have been proposed, including the use of synthetic data generation. Synthetic data are generated based on some existing data, with the aim of replicating them as closely as possible and acting as a proxy for real sensitive data. This paper presents a systematic review of methods for generating and evaluating synthetic longitudinal patient data, a prevalent data type in medicine. The review adheres to the PRISMA guidelines and covers literature from five databases until the end of 2022. The paper describes 17 methods, ranging from traditional simulation techniques to modern deep learning methods. The collected information includes, but is not limited to, method type, source code availability, and approaches used to assess resemblance, utility, and privacy. Furthermore, the paper discusses practical guidelines and key considerations for developing synthetic longitudinal data generation methods.
翻译:近年来,数据的激增推动了各种统计与深度学习技术的发展与应用,从而加速了研究与开发活动。然而,并非所有行业都从数据可用性的增长中同等受益,部分原因在于数据使用方面的法律限制与隐私法规,例如在医学领域。为解决这一问题,研究者提出了多种统计披露控制与隐私保护方法,包括使用合成数据生成技术。合成数据基于现有数据生成,旨在尽可能接近地复制原始数据,并作为真实敏感数据的替代品。本文对合成纵向患者数据(医学中常见的数据类型)的生成与评估方法进行了系统性综述。该综述遵循PRISMA指南,涵盖了截至2022年底来自五个数据库的相关文献。本文描述了17种方法,涵盖从传统模拟技术到现代深度学习方法的范围。收集的信息包括但不限于方法类型、源代码可用性以及用于评估相似性、效用性和隐私性的方法。此外,本文还讨论了开发合成纵向数据生成方法的实践指南与关键考量因素。