As genomic research has grown increasingly popular in recent years, dataset sharing has remained limited due to privacy concerns. This limitation hinders the reproducibility and validation of research outcomes, both of which are essential for identifying computational errors during the research process. In this paper, we introduce PROVGEN, a privacy-preserving method for sharing genomic datasets that facilitates reproducibility and outcome validation in genome-wide association studies (GWAS). Our approach encodes genomic data into binary space and applies a two-stage process. First, we generate a differentially private version of the dataset using an XOR-based mechanism that incorporates biological characteristics. Second, we restore data utility by adjusting the Minor Allele Frequency (MAF) values in the noisy dataset to align with published MAFs using optimal transport. Finally, we convert the processed binary data back into its genomic representation and publish the resulting dataset. We evaluate PROVGEN on three real-world genomic datasets and compare it with local differential privacy and three synthesis-based methods. We show that our proposed scheme outperforms all existing methods in detecting GWAS outcome errors, achieves better data utility, and provides higher privacy protection against membership inference attacks (MIAs). By adopting our method, genomic researchers will be inclined to share differentially private datasets while maintaining high data quality for reproducibility of their findings.
翻译:近年来,随着基因组研究的日益普及,由于隐私问题,数据集的共享仍然有限。这一限制阻碍了研究结果的可重复性和验证,而这两者对于识别研究过程中的计算错误至关重要。本文介绍了PROVGEN,这是一种用于共享基因组数据集的隐私保护方法,旨在促进全基因组关联研究(GWAS)中的可重复性和结果验证。我们的方法将基因组数据编码到二进制空间,并应用一个两阶段过程。首先,我们使用一种结合了生物学特征的基于XOR的机制,生成数据集的差分隐私版本。其次,我们通过使用最优传输调整噪声数据集中的次要等位基因频率(MAF)值,使其与已发布的MAF保持一致,从而恢复数据效用。最后,我们将处理后的二进制数据转换回其基因组表示,并发布生成的数据集。我们在三个真实世界的基因组数据集上评估了PROVGEN,并将其与本地差分隐私和三种基于合成的方法进行了比较。结果表明,我们提出的方案在检测GWAS结果错误方面优于所有现有方法,实现了更好的数据效用,并针对成员推理攻击(MIA)提供了更高的隐私保护。通过采用我们的方法,基因组研究人员将倾向于共享差分隐私数据集,同时保持高质量的数据以支持其研究结果的可重复性。