PROVGEN: A Privacy-Preserving Approach for Outcome Validation in Genomic Research

As genomic research has grown increasingly popular in recent years, dataset sharing has remained limited due to privacy concerns. This limitation hinders the reproducibility and validation of research outcomes, both of which are essential for identifying computational errors during the research process. In this paper, we introduce PROVGEN, a privacy-preserving method for sharing genomic datasets that facilitates reproducibility and outcome validation in genome-wide association studies (GWAS). Our approach encodes genomic data into binary space and applies a two-stage process. First, we generate a differentially private version of the dataset using an XOR-based mechanism that incorporates biological characteristics. Second, we restore data utility by adjusting the Minor Allele Frequency (MAF) values in the noisy dataset to align with published MAFs using optimal transport. Finally, we convert the processed binary data back into its genomic representation and publish the resulting dataset. We evaluate PROVGEN on three real-world genomic datasets and compare it with local differential privacy and three synthesis-based methods. We show that our proposed scheme outperforms all existing methods in detecting GWAS outcome errors, achieves better data utility, and provides higher privacy protection against membership inference attacks (MIAs). By adopting our method, genomic researchers will be inclined to share differentially private datasets while maintaining high data quality for reproducibility of their findings.

翻译：近年来，随着基因组研究的日益普及，由于隐私问题，数据集的共享仍然有限。这一限制阻碍了研究结果的可重复性和验证，而这两者对于识别研究过程中的计算错误至关重要。本文介绍了PROVGEN，这是一种用于共享基因组数据集的隐私保护方法，旨在促进全基因组关联研究（GWAS）中的可重复性和结果验证。我们的方法将基因组数据编码到二进制空间，并应用一个两阶段过程。首先，我们使用一种结合了生物学特征的基于XOR的机制，生成数据集的差分隐私版本。其次，我们通过使用最优传输调整噪声数据集中的次要等位基因频率（MAF）值，使其与已发布的MAF保持一致，从而恢复数据效用。最后，我们将处理后的二进制数据转换回其基因组表示，并发布生成的数据集。我们在三个真实世界的基因组数据集上评估了PROVGEN，并将其与本地差分隐私和三种基于合成的方法进行了比较。结果表明，我们提出的方案在检测GWAS结果错误方面优于所有现有方法，实现了更好的数据效用，并针对成员推理攻击（MIA）提供了更高的隐私保护。通过采用我们的方法，基因组研究人员将倾向于共享差分隐私数据集，同时保持高质量的数据以支持其研究结果的可重复性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日