Reproducibility-Oriented and Privacy-Preserving Genomic Dataset Sharing

As genomic research has become increasingly widespread in recent years, few studies share datasets due to the sensitivity in privacy of genomic records. This hinders the reproduction and validation of research outcomes, which are crucial for catching errors (e.g., miscalculations) during the research process.To the best of our knowledge, we are the first to propose a method of sharing genomic datasets in a privacy-preserving manner for GWAS outcome reproducibility.In this work, we introduce a differential privacy-based scheme for sharing genomic datasets to enhance the reproducibility of genome-wide association studies (GWAS) outcomes. The scheme involves two stages. In the first stage, we generate a noisy copy of the target dataset by applying the XOR mechanism on the binarized (encoded) dataset, where the binary noise generation considers biological features. However, the initial step introduces significant noise, making the dataset less suitable for direct GWAS validation. Thus, in the second stage, we implement a post-processing technique that adjusts the Minor Allele Frequency (MAF) values in the noisy dataset to align more closely with those in a publicly available dataset using optimal transport and decode it back to genomic space. We evaluated the proposed scheme on three real-life genomic datasets and compared it with a baseline approach and two synthesis-based solutions with regard to detecting errors of GWAS outcomes, data utility, and resistance against membership inference attacks (MIAs). Our scheme outperforms all the comparing methods in detecting GWAS outcome errors, achieves better utility and provides higher privacy protection against membership inference attacks (MIAs). By utilizing our method, genomic researchers will be inclined to share a differentially private, yet of high quality version of their datasets.

翻译：随着近年来基因组研究的日益普及，由于基因组记录对隐私的敏感性，很少有研究会共享数据集。这阻碍了研究成果的复现与验证，而这对在研究过程中发现错误（例如计算错误）至关重要。据我们所知，我们是首个提出在保护隐私的前提下共享基因组数据集以实现全基因组关联研究（GWAS）结果可复现性的方法。在本工作中，我们引入了一种基于差分隐私的方案，用于共享基因组数据集以增强全基因组关联研究（GWAS）结果的可复现性。该方案包含两个阶段。第一阶段，我们通过将异或（XOR）机制应用于二值化（编码后的）数据集来生成目标数据集的含噪声副本，其中二值噪声的生成考虑了生物学特征。然而，初始步骤引入的噪声过大，导致数据集难以直接用于GWAS验证。因此，在第二阶段，我们实现了一种后处理技术，利用最优传输方法调整含噪数据集中次要等位基因频率（MAF）值，使其与公开数据集中的对应值更接近，并将其解码回基因组空间。我们在三个真实基因组数据集上评估了所提方案，并与基线方法及两种基于合成的解决方案进行了对比，涉及检测GWAS结果错误的能力、数据效用以及对成员推断攻击（MIA）的抵抗能力。结果表明，我们的方案在检测GWAS结果错误方面优于所有对比方法，同时实现了更好的数据效用，并对成员推断攻击（MIA）提供了更高的隐私保护。利用我们的方法，基因组研究人员将更倾向于共享其数据集的高质量差分隐私版本。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日