The success of AI is based on the availability of data to train models. While in some cases a single data custodian may have sufficient data to enable AI, often multiple custodians need to collaborate to reach a cumulative size required for meaningful AI research. The latter is, for example, often the case for rare diseases, with each clinical site having data for only a small number of patients. Recent algorithms for federated synthetic data generation are an important step towards collaborative, privacy-preserving data sharing. Existing techniques, however, focus exclusively on synthesizer training, assuming that the training data is already preprocessed and that the desired synthetic data can be delivered in one shot, without any hyperparameter tuning. In this paper, we propose an end-to-end collaborative framework for publishing of synthetic data that accounts for privacy-preserving preprocessing as well as evaluation. We instantiate this framework with Secure Multiparty Computation (MPC) protocols and evaluate it in a use case for privacy-preserving publishing of synthetic genomic data for leukemia.
翻译:人工智能的成功依赖于训练模型所需的数据可用性。在某些情况下,单个数据保管方可能拥有足够的数据来支持人工智能应用,但通常需要多个保管方协同合作,以达到有意义的人工智能研究所需的累积数据规模。例如,在罕见病研究领域,这种情况尤为常见,因为每个临床站点仅拥有少量患者的数据。近年来,联邦合成数据生成算法为实现协同、隐私保护的数据共享迈出了重要一步。然而,现有技术仅专注于合成器训练,其假设训练数据已完成预处理,且无需任何超参数调整即可一次性交付所需的合成数据。本文提出一种端到端的协同框架,用于发布合成数据,该框架兼顾了隐私保护预处理与数据评估。我们采用安全多方计算协议对该框架进行实例化,并在白血病基因组数据的隐私保护发布用例中对其进行了评估。