DP-SNP-TIHMM：用于合成全基因组关联数据集的差分隐私、时间非齐次隐马尔可夫模型 (DP-SNP-TIHMM: Differentially Private, Time-Inhomogeneous Hidden Markov Models for Synthesizing Genome-Wide Association Datasets)

Single nucleotide polymorphism (SNP) datasets are fundamental to genetic studies but pose significant privacy risks when shared. The correlation of SNPs with each other makes strong adversarial attacks such as masked-value reconstruction, kin, and membership inference attacks possible. Existing privacy-preserving approaches either apply differential privacy to statistical summaries of these datasets or offer complex methods that require post-processing and the usage of a publicly available dataset to suppress or selectively share SNPs. In this study, we introduce an innovative framework for generating synthetic SNP sequence datasets using samples derived from time-inhomogeneous hidden Markov models (TIHMMs). To preserve the privacy of the training data, we ensure that each SNP sequence contributes only a bounded influence during training, enabling strong differential privacy guarantees. Crucially, by operating on full SNP sequences and bounding their gradient contributions, our method directly addresses the privacy risks introduced by their inherent correlations. Through experiments conducted on the real-world 1000 Genomes dataset, we demonstrate the efficacy of our method using privacy budgets of $\varepsilon \in [1, 10]$ at $\delta=10^{-4}$. Notably, by allowing the transition models of the HMM to be dependent on the location in the sequence, we significantly enhance performance, enabling the synthetic datasets to closely replicate the statistical properties of non-private datasets. This framework facilitates the private sharing of genomic data while offering researchers exceptional flexibility and utility.

翻译：单核苷酸多态性（SNP）数据集是遗传学研究的基础，但在共享时会带来重大的隐私风险。SNP之间的相关性使得诸如掩码值重构、亲属关系推断和成员推断攻击等强对抗攻击成为可能。现有的隐私保护方法要么对这些数据集的统计摘要应用差分隐私，要么提供复杂的方法，这些方法需要后处理并使用公开可用的数据集来抑制或有选择地共享SNP。在本研究中，我们引入了一种创新的框架，利用时间非齐次隐马尔可夫模型（TIHMM）生成的样本来合成SNP序列数据集。为了保护训练数据的隐私，我们确保每个SNP序列在训练过程中仅贡献有限的影响，从而实现强差分隐私保证。至关重要的是，通过对完整的SNP序列进行操作并限制其梯度贡献，我们的方法直接解决了由其固有相关性引入的隐私风险。通过在真实世界的1000 Genomes数据集上进行的实验，我们证明了我们的方法在$\delta=10^{-4}$、隐私预算为$\varepsilon \in [1, 10]$时的有效性。值得注意的是，通过允许HMM的转移模型依赖于序列中的位置，我们显著提升了性能，使得合成数据集能够紧密复现非隐私数据集的统计特性。该框架促进了基因组数据的隐私共享，同时为研究人员提供了卓越的灵活性和实用性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日