When an individual's DNA is sequenced, sensitive medical information becomes available to the sequencing laboratory. A recently proposed way to hide an individual's genetic information is to mix in DNA samples of other individuals. We assume that the genetic content of these samples is known to the individual but unknown to the sequencing laboratory. Thus, these DNA samples act as "noise" to the sequencing laboratory, but still allow the individual to recover their own DNA samples afterward. Motivated by this idea, we study the problem of hiding a binary random variable $X$ (a genetic marker) with the additive noise provided by mixing DNA samples, using mutual information as a privacy metric. This is equivalent to the problem of finding a worst-case noise distribution for recovering $X$ from the noisy observation among a set of feasible discrete distributions. We characterize upper and lower bounds to the solution of this problem, which are empirically shown to be very close. The lower bound is obtained through a convex relaxation of the original discrete optimization problem, and yields a closed-form expression. The upper bound is computed via a greedy algorithm for selecting the mixing proportions.
翻译:当个体的DNA被测序时,测序实验室将能够获取敏感的医疗信息。最近提出的一种隐藏个体遗传信息的方法是混入其他个体的DNA样本。我们假设这些样本的遗传内容对个体已知,但对测序实验室未知。因此,这些DNA样本对测序实验室构成“噪声”,但仍允许个体在事后恢复自身的DNA样本。受此思路启发,我们研究隐藏二元随机变量$X$(一种遗传标记)的问题,利用混合DNA样本提供的加性噪声,并以互信息作为隐私度量指标。这等价于在可行离散分布集合中,寻找使从噪声观测中恢复$X$最困难的噪声分布问题。我们刻画了该问题解的上界与下界,经验表明二者非常接近。下界通过原离散优化问题的凸松弛获得,并给出闭式表达式。上界则通过选择混合比例的一种贪心算法计算得出。