The proliferation of data in recent years has led to the advancement and utilization of various statistical and deep learning techniques, thus expediting research and development activities. However, not all industries have benefited equally from the surge in data availability, partly due to legal restrictions on data usage and privacy regulations, such as in medicine. To address this issue, various statistical disclosure and privacy-preserving methods have been proposed, including the use of synthetic data generation. Synthetic data are generated based on some existing data, with the aim of replicating them as closely as possible and acting as a proxy for real sensitive data. This paper presents a systematic review of methods for generating and evaluating synthetic longitudinal patient data, a prevalent data type in medicine. The review adheres to the PRISMA guidelines and covers literature from five databases until the end of 2022. The paper describes 17 methods, ranging from traditional simulation techniques to modern deep learning methods. The collected information includes, but is not limited to, method type, source code availability, and approaches used to assess resemblance, utility, and privacy. Furthermore, the paper discusses practical guidelines and key considerations for developing synthetic longitudinal data generation methods.
翻译:近年来数据的激增推动了各种统计与深度学习技术的进步与发展,从而加速了研究及开发活动。然而,并非所有行业都能从数据可用性的增长中同等受益,部分原因在于数据使用和隐私保护法规的法律限制,例如在医学领域。为解决这一问题,研究者提出了多种统计披露控制及隐私保护方法,其中包括使用合成数据生成技术。合成数据基于现有数据生成,旨在尽可能准确地复制原始数据,并充当真实敏感数据的替代品。本文对医学中常见的数据类型——纵向患者数据的生成与评估方法进行了系统综述。该综述遵循PRISMA指南,涵盖截至2022年底来自五个数据库的文献。论文描述了从传统模拟技术到现代深度学习方法的17种方法。收集的信息包括但不限于方法类型、源代码可用性以及用于评估相似性、实用性和隐私性的方法。此外,本文还讨论了开发合成纵向数据生成方法的实用指南和关键考量因素。