In computational biology, predictive models are widely used to address complex tasks, but their performance can suffer greatly when applied to data from different distributions. The current state-of-the-art domain adaptation method for high-dimensional data aims to mitigate these issues by aligning the input dependencies between training and test data. However, this approach requires centralized access to both source and target domain data, raising concerns about data privacy, especially when the data comes from multiple sources. In this paper, we introduce a privacy-preserving federated framework for unsupervised domain adaptation in high-dimensional settings. Our method employs federated training of Gaussian processes and weighted elastic nets to effectively address the problem of distribution shift between domains, while utilizing secure aggregation and randomized encoding to protect the local data of participating data owners. We evaluate our framework on the task of age prediction using DNA methylation data from multiple tissues, demonstrating that our approach performs comparably to existing centralized methods while maintaining data privacy, even in distributed environments where data is spread across multiple institutions. Our framework is the first privacy-preserving solution for high-dimensional domain adaptation in federated environments, offering a promising tool for fields like computational biology and medicine, where protecting sensitive data is essential.
翻译:在计算生物学中,预测模型被广泛用于解决复杂任务,但当应用于不同分布的数据时,其性能可能大幅下降。当前针对高维数据的最先进域自适应方法旨在通过对齐训练数据与测试数据之间的输入依赖关系来缓解这些问题。然而,这种方法需要集中访问源域和目标域数据,引发了数据隐私方面的担忧,尤其是在数据来自多个来源时。本文提出了一种适用于高维环境的隐私保护联邦无监督域自适应框架。我们的方法采用高斯过程的联邦训练和加权弹性网络,以有效解决域间分布偏移问题,同时利用安全聚合和随机编码来保护参与数据所有者的本地数据。我们使用来自多种组织的DNA甲基化数据在年龄预测任务上评估了该框架,结果表明,即使在数据分布于多个机构的分布式环境中,我们的方法在保持数据隐私的同时,其性能与现有集中式方法相当。本框架是首个面向联邦环境的高维域自适应隐私保护解决方案,为计算生物学和医学等敏感数据保护至关重要的领域提供了一个有前景的工具。