The increasing pace in genomic research has brought a high demand for genomic datasets in recent years, yet few studies have released their datasets due to privacy concerns. This poses a problem while validating and reproducing the published results. In this work, in order to promote reproducibility of genome-related research, we propose a novel scheme for sharing genomic datasets under differential privacy, which consists of two stages. In the first step, the scheme generates a noisy copy of the genomic dataset by encoding the data entries as binary values and then XORing them with binary noise, that is calibrated and sampled with optimized time complexity, while considering the biology properties of the datasets. In the second step, the scheme alters the value distribution of each column in the generated copy to align with the privacy-preserving version (protected by the Laplace mechanism) of the distribution in the original dataset using optimal transport. We evaluate the scheme on two realistic genomic datasets from OpenSNP~\cite{opensnp} and compare it with two existing privacy-preserving techniques from NIST challenges~\cite{nist} in regard to GWAS reproducibility (e.g., the $\chi^2$ and the odd ratio test) and other data utility metrics (e.g., point error and mean error). The results show that our scheme outperforms the two methods in GWAS reproducibility by $30\%$ with lower time complexity and achieves higher data utility for other applications as well beyond reproducibility. We also validate via experiments that our scheme achieves high protection against both genomic and machine learning-based inference attacks. The experiment results show that, by constraining the privacy leakage, our mechanism is able to encourage the sharing of a genomic dataset along with the research results on it.
翻译:近年来,基因组研究的快速发展带来了对基因组数据集的巨大需求,然而因隐私顾虑,鲜有研究公开其数据集。这一现象对已发表结果的验证与复现构成了挑战。为促进基因组相关研究的可复现性,本文提出了一种基于差分隐私的基因组数据集共享新方案,该方案包含两个阶段。第一阶段,通过将数据条目编码为二进制值,并与经优化时间复杂度校准和采样的二进制噪声进行异或运算,生成带噪声的基因组数据集副本,同时兼顾数据集的生物学特性。第二阶段,利用最优传输方法调整生成副本中各列的值分布,使其与原始数据集中经拉普拉斯机制保护的隐私版本分布保持一致。我们在 OpenSNP~\cite{opensnp}的两个真实基因组数据集上评估了该方案,并在全基因组关联分析可复现性(例如 χ²检验和比值比检验)及其他数据效用指标(例如点误差和均值误差)方面,与 NIST 挑战赛~\cite{nist}中两项现有隐私保护技术进行了对比。结果表明,本方案在全基因组关联分析可复现性方面以更低时间复杂度超越两种对比方法达 30%,并在可复现性之外的其他应用中实现了更高数据效用。实验进一步验证,本方案对基于基因组和机器学习的推断攻击均具有强保护能力。结果表明,通过约束隐私泄露,该机制可有效促进基因组数据集及其研究成果的共享。