Reproducibility-Oriented and Privacy-Preserving Genomic Dataset Sharing

The increasing pace in genomic research has brought a high demand for genomic datasets in recent years, yet few studies have released their datasets due to privacy concerns. This poses a problem while validating and reproducing the published results. In this work, in order to promote reproducibility of genome-related research, we propose a novel scheme for sharing genomic datasets under differential privacy, which consists of two stages. In the first step, the scheme generates a noisy copy of the genomic dataset by encoding the data entries as binary values and then XORing them with binary noise, that is calibrated and sampled with optimized time complexity, while considering the biology properties of the datasets. In the second step, the scheme alters the value distribution of each column in the generated copy to align with the privacy-preserving version (protected by the Laplace mechanism) of the distribution in the original dataset using optimal transport. We evaluate the scheme on two realistic genomic datasets from OpenSNP~\cite{opensnp} and compare it with two existing privacy-preserving techniques from NIST challenges~\cite{nist} in regard to GWAS reproducibility (e.g., the $\chi^2$ and the odd ratio test) and other data utility metrics (e.g., point error and mean error). The results show that our scheme outperforms the two methods in GWAS reproducibility by $30\%$ with lower time complexity and achieves higher data utility for other applications as well beyond reproducibility. We also validate via experiments that our scheme achieves high protection against both genomic and machine learning-based inference attacks. The experiment results show that, by constraining the privacy leakage, our mechanism is able to encourage the sharing of a genomic dataset along with the research results on it.

翻译：近年来，基因组研究的快速发展带来了对基因组数据集的巨大需求，然而因隐私顾虑，鲜有研究公开其数据集。这一现象对已发表结果的验证与复现构成了挑战。为促进基因组相关研究的可复现性，本文提出了一种基于差分隐私的基因组数据集共享新方案，该方案包含两个阶段。第一阶段，通过将数据条目编码为二进制值，并与经优化时间复杂度校准和采样的二进制噪声进行异或运算，生成带噪声的基因组数据集副本，同时兼顾数据集的生物学特性。第二阶段，利用最优传输方法调整生成副本中各列的值分布，使其与原始数据集中经拉普拉斯机制保护的隐私版本分布保持一致。我们在 OpenSNP~\cite{opensnp}的两个真实基因组数据集上评估了该方案，并在全基因组关联分析可复现性（例如 χ²检验和比值比检验）及其他数据效用指标（例如点误差和均值误差）方面，与 NIST 挑战赛~\cite{nist}中两项现有隐私保护技术进行了对比。结果表明，本方案在全基因组关联分析可复现性方面以更低时间复杂度超越两种对比方法达 30%，并在可复现性之外的其他应用中实现了更高数据效用。实验进一步验证，本方案对基于基因组和机器学习的推断攻击均具有强保护能力。结果表明，通过约束隐私泄露，该机制可有效促进基因组数据集及其研究成果的共享。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

【干货书】隐私保留机器学习，Privacy-Preserving Machine Learning

专知会员服务

27+阅读 · 2022年4月6日

神经常微分方程教程，50页ppt，A brief tutorial on Neural ODEs

专知会员服务

74+阅读 · 2020年8月2日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日