In this paper, we consider the problem of answering count queries for genomic data subject to perfect privacy constraints. Count queries are often used in applications that collect aggregate (population-wide) information from biomedical Databases (DBs) for analysis, such as Genome-wide association studies. Our goal is to design mechanisms for answering count queries of the following form: \textit{How many users in the database have a specific set of genotypes at certain locations in their genome?} At the same time, we aim to achieve perfect privacy (zero information leakage) of the sensitive genotypes at a pre-specified set of secret locations. The sensitive genotypes could indicate rare diseases and/or other health traits one may want to keep private. We present both local and central count-query mechanisms for the above problem that achieves perfect information-theoretic privacy for sensitive genotypes while minimizing the expected absolute error (or per-user error probability, depending on the setting) of the query answer. We also derived a lower bound of the per-user probability of error for an arbitrary query-answering mechanism that satisfies perfect privacy. We show that our mechanisms achieve error close to the lower bound, and match the lower bound for some special cases. We numerically show that the performance of each mechanism depends on the data prior distribution, the intersection between the queried and sensitive genotypes, and the strength of the correlation in the genomic data sequence.
翻译:本文研究在完美隐私约束下回答基因组数据计数查询的问题。计数查询常用于从生物医学数据库(DB)中收集聚合(群体范围)信息以进行分析的应用,例如全基因组关联研究。我们的目标是设计机制来回答以下形式的计数查询:\textit{数据库中有多少用户在基因组特定位置具有特定的基因型组合?}同时,我们旨在实现预设秘密位置集上敏感基因型的完美隐私(零信息泄露)。这些敏感基因型可能指示罕见疾病和/或其他需要保密的健康特征。针对上述问题,我们提出了本地和中心化两种计数查询机制,这些机制在实现敏感基因型完美信息论隐私的同时,最小化查询答案的期望绝对误差(或根据设置的不同,为每用户误差概率)。我们还推导了任意满足完美隐私的查询回答机制的每用户误差概率下界。证明我们的机制所实现的误差接近该下界,并在某些特殊情况下与下界匹配。数值实验表明,每种机制的性能取决于数据先验分布、查询基因型与敏感基因型的交集,以及基因组数据序列中相关性的强度。