Access to individual-level health data is essential for gaining new insights and advancing science. In particular, modern methods based on artificial intelligence rely on the availability of and access to large datasets. In the health sector, access to individual-level data is often challenging due to privacy concerns. A promising alternative is the generation of fully synthetic data, i.e. data generated through a randomised process that have similar statistical properties as the original data, but do not have a one-to-one correspondence with the original individual-level records. In this study, we use a state-of-the-art synthetic data generation method and perform in-depth quality analyses of the generated data for a specific use case in the field of nutrition. We demonstrate the need for careful analyses of synthetic data that go beyond descriptive statistics and provide valuable insights into how to realise the full potential of synthetic datasets. By extending the methods, but also by thoroughly analysing the effects of sampling from a trained model, we are able to largely reproduce significant real-world analysis results in the chosen use case.
翻译:获取个体层面的健康数据对于获得新见解和推动科学进步至关重要。特别是基于人工智能的现代方法依赖于大规模数据集的可用性与可访问性。在健康领域,由于隐私问题,获取个体层面数据往往面临挑战。一种有前景的替代方案是生成完全合成数据,即通过随机化过程生成的数据,这些数据具有与原始数据相似的统计特性,但与原始个体记录之间不存在一一对应关系。本研究采用目前最先进的合成数据生成方法,针对营养学领域的一个具体用例,对生成数据进行了深入的质量分析。我们论证了需要对合成数据开展超越描述性统计的审慎分析,并为如何充分发挥合成数据集潜力提供了宝贵见解。通过扩展方法以及对基于已训练模型进行采样所产生的效果的深入分析,我们得以在选定用例中较大幅度地复现了真实的实际数据分析结果。