The success of AI is based on the availability of data to train models. While in some cases a single data custodian may have sufficient data to enable AI, often multiple custodians need to collaborate to reach a cumulative size required for meaningful AI research. The latter is, for example, often the case for rare diseases, with each clinical site having data for only a small number of patients. Recent algorithms for federated synthetic data generation are an important step towards collaborative, privacy-preserving data sharing. Existing techniques, however, focus exclusively on synthesizer training, assuming that the training data is already preprocessed and that the desired synthetic data can be delivered in one shot, without any hyperparameter tuning. In this paper, we propose an end-to-end collaborative framework for publishing of synthetic data that accounts for privacy-preserving preprocessing as well as evaluation. We instantiate this framework with Secure Multiparty Computation (MPC) protocols and evaluate it in a use case for privacy-preserving publishing of synthetic genomic data for leukemia.
翻译:人工智能的成功依赖于训练模型所需的数据可用性。在某些情况下,单个数据托管方可能拥有足够的数据来支持人工智能应用,但通常需要多个托管方协作才能达到有意义的人工智能研究所要求的累积数据规模。例如,在罕见病研究中,后者情况尤为常见,每个临床站点仅拥有少量患者的数据。近年来,联邦合成数据生成算法为实现协同、隐私保护的数据共享迈出了重要一步。然而,现有技术仅专注于合成器训练,其假设训练数据已完成预处理,且所需合成数据可一次性交付而无需超参数调优。本文提出一种端到端的协同框架,用于发布合成数据,该框架同时考虑了隐私保护预处理与评估环节。我们采用安全多方计算协议对该框架进行实例化,并在白血病基因组数据的隐私保护合成发布应用场景中进行了评估。