The generation of synthetic medical records using generative adversarial networks (GANs) has become increasingly important for addressing privacy concerns and promoting data sharing in the medical field. In this paper, we propose a novel method for generating synthetic hybrid medical records consisting of chest X-ray images (CXRs) and structured tabular data (including anthropometric data and laboratory tests) using an auto-encoding GAN ({\alpha}GAN) and a conditional tabular GAN (CTGAN). Our approach involves training a {\alpha}GAN model on a large public database (pDB) to reduce the dimensionality of CXRs. We then applied the trained encoder of the GAN model to the images in original database (oDB) to obtain the latent vectors. These latent vectors were combined with tabular data in oDB, and these joint data were used to train the CTGAN model. We successfully generated diverse synthetic records of hybrid CXR and tabular data, maintaining correspondence between them. We evaluated this synthetic database (sDB) through visual assessment, distribution of interrecord distances, and classification tasks. Our evaluation results showed that the sDB captured the features of the oDB while maintaining the correspondence between the images and tabular data. Although our approach relies on the availability of a large-scale pDB containing a substantial number of images with the same modality and imaging region as those in the oDB, this method has the potential for the public release of synthetic datasets without compromising the secondary use of data.
翻译:使用生成对抗网络(GANs)生成合成医疗记录对于解决医疗领域的隐私问题和促进数据共享日益重要。本文提出了一种新方法,利用自动编码生成对抗网络(αGAN)和条件表格生成对抗网络(CTGAN)生成由胸部X光图像(CXRs)和结构化表格数据(包括人体测量数据和实验室检测)组成的合成混合医疗记录。我们的方法包括在大型公共数据库(pDB)上训练αGAN模型以降低CXRs的维度。然后,我们将经过训练的GAN模型编码器应用于原始数据库(oDB)中的图像,以获取潜在向量。这些潜在向量与oDB中的表格数据相结合,并利用这些联合数据训练CTGAN模型。我们成功生成了多样化的合成混合记录,包括CXR和表格数据,并保持它们之间的对应关系。我们通过视觉评估、记录间距离分布和分类任务对该合成数据库(sDB)进行了评估。评估结果表明,sDB捕捉到了oDB的特征,同时保持了图像与表格数据之间的对应关系。尽管我们的方法依赖于与oDB具有相同模态和成像区域的大量图像的大型pDB的可用性,但该方法有潜力在不影响数据二次使用的情况下公开发布合成数据集。