Access to genomic data is highly regulated due to its sensitive nature. While safeguards are essential, cumbersome data access processes pose a significant barrier to the development of AI methods for genomics. Synthetic data generation can mitigate this tension by enabling broader data sharing without exposing sensitive information. Synthetic genomic data are produced by training generative models on real data and subsequently sampling artificial data that preserves relevant statistics while limiting disclosures about the underlying individuals. In some settings, a single data holder may have sufficient data to train such generative models; however, in many applications data must be combined across multiple sites to achieve adequate scale. This need arises, e.g., in rare disease studies, where individual hospitals typically hold data for only a small number of patients. The solution we present in this paper enables multiple data holders to jointly train a synthetic data generator without revealing their raw data. Our approach combines secure multiparty computation (MPC) to ensure input privacy, so that no party ever discloses its data in unencrypted form, with differential privacy (DP) to provide output privacy by mitigating information leakage from the released synthetic data. We empirically demonstrate the effectiveness of the proposed method by generating high-utility synthetic datasets from multiple real RNA-seq cohorts in federated settings, showing that our approach enables privacy-preserving data synthesis even when data are distributed across institutions.
翻译:获取基因组数据因其敏感性受到严格监管。尽管安全措施至关重要,但繁琐的数据访问流程对基因组学中人工智能方法的发展构成了重大障碍。合成数据生成可通过在更广泛数据共享的同时避免泄露敏感信息来缓解这一矛盾。合成基因组数据是通过在真实数据上训练生成模型,随后采样人工数据生成的,这些人工数据保留了相关统计特征,同时限制了对个体信息的泄露。在某些场景中,单个数据持有者可能拥有足够数据训练此类生成模型;然而在许多应用中,需跨多个站点合并数据才能达到足够的规模。例如在罕见疾病研究中,单个医院通常仅持有少量患者数据。本文提出的解决方案使多个数据持有者能够联合训练合成数据生成器,同时避免泄露原始数据。我们的方法结合安全多方计算(MPC)确保输入隐私——任何参与方均不会以未加密形式披露其数据——以及差分隐私(DP)通过降低从发布的合成数据中泄露信息风险来提供输出隐私。我们通过跨多个真实RNA-seq队列在联邦设置下生成高实用性合成数据集,实证证明了所提方法的有效性,表明该方法即便在数据分布于不同机构时也能实现隐私保护的数据合成。