Machine-generated data is a valuable resource for training Artificial Intelligence algorithms, evaluating rare workflows, and sharing data under stricter data legislations. The challenge is to generate data that is accurate and private. Current statistical and deep learning methods struggle with large data volumes, are prone to hallucinating scenarios incompatible with reality, and seldom quantify privacy meaningfully. Here we introduce Genomator, a logic solving approach (SAT solving), which efficiently produces private and realistic representations of the original data. We demonstrate the method on genomic data, which arguably is the most complex and private information. Synthetic genomes hold great potential for balancing underrepresented populations in medical research and advancing global data exchange. We benchmark Genomator against state-of-the-art methodologies (Markov generation, Restricted Boltzmann Machine, Generative Adversarial Network and Conditional Restricted Boltzmann Machines), demonstrating an 84-93% accuracy improvement and 95-98% higher privacy. Genomator is also 1000-1600 times more efficient, making it the only tested method that scales to whole genomes. We show the universal trade-off between privacy and accuracy, and use Genomator's tuning capability to cater to all applications along the spectrum, from provable private representations of sensitive cohorts, to datasets with indistinguishable pharmacogenomic profiles. Demonstrating the production-scale generation of tuneable synthetic data can increase trust and pave the way into the clinic.
翻译:机器生成的数据是训练人工智能算法、评估罕见工作流程以及在更严格的数据立法下共享数据的宝贵资源。挑战在于生成既准确又私密的数据。当前的统计和深度学习方法难以处理海量数据,容易产生与现实不符的幻觉场景,且很少能对隐私进行有意义的量化。本文介绍Genomator,一种逻辑求解方法(SAT求解),它能高效生成原始数据的私密且真实的表示。我们在基因组数据上验证了该方法,基因组数据可以说是最复杂且最私密的信息。合成基因组在平衡医学研究中代表性不足的群体和推进全球数据交换方面具有巨大潜力。我们将Genomator与最先进的方法(马尔可夫生成、受限玻尔兹曼机、生成对抗网络和条件受限玻尔兹曼机)进行基准测试,结果显示其准确性提高了84-93%,隐私性提升了95-98%。Genomator的效率也高出1000-1600倍,使其成为唯一能扩展到全基因组规模的测试方法。我们展示了隐私与准确性之间普遍存在的权衡关系,并利用Genomator的可调谐能力来满足该谱系上的所有应用需求——从敏感群体的可证明私密表示,到具有不可区分药物基因组学特征的数据集。展示可调谐合成数据的生产规模生成,可以增强信任并为进入临床应用铺平道路。