DNA motif discovery is an important issue in gene research, which aims to identify transcription factor binding sites (i.e., motifs) in DNA sequences to reveal the mechanisms that regulate gene expression. However, the phenomenon of data silos and the problem of privacy leakage have seriously hindered the development of DNA motif discovery. On the one hand, the phenomenon of data silos makes data collection difficult. On the other hand, the collection and use of DNA data become complicated and difficult because DNA is sensitive private information. In this context, how discovering DNA motifs under the premise of ensuring privacy and security and alleviating data silos has become a very important issue. Therefore, this paper proposes a novel method, namely DP-FLMD, to address this problem. Note that this is the first application of federated learning to the field of genetics research. The federated learning technique is used to solve the problem of data silos. It has the advantage of enabling multiple participants to train models together and providing privacy protection services. To address the challenges of federated learning in terms of communication costs, this paper applies a sampling method and a strategy for reducing communication costs to DP-FLMD. In addition, differential privacy, a privacy protection technique with rigorous mathematical proof, is also applied to DP-FLMD. Experiments on the DNA datasets show that DP-FLMD has high mining accuracy and runtime efficiency, and the performance of the algorithm is affected by some parameters.
翻译:DNA基序发现是基因研究中的重要课题,旨在识别DNA序列中的转录因子结合位点(即基序),从而揭示调控基因表达的机制。然而,数据孤岛现象与隐私泄露问题严重阻碍了DNA基序发现的发展。一方面,数据孤岛现象导致数据收集困难;另一方面,由于DNA属于敏感隐私信息,其收集与使用变得复杂且困难。在此背景下,如何在保障隐私安全与缓解数据孤岛的前提下进行DNA基序发现,已成为一个至关重要的问题。为此,本文提出了一种新颖方法——DP-FLMD来解决该问题。值得注意的是,这是联邦学习首次应用于遗传学研究领域。联邦学习技术被用于解决数据孤岛问题,其优势在于支持多方共同训练模型并提供隐私保护服务。为应对联邦学习在通信成本方面的挑战,本文将采样方法及通信成本缩减策略应用于DP-FLMD。此外,具有严格数学证明的隐私保护技术——差分隐私,也被引入DP-FLMD中。在DNA数据集上的实验表明,DP-FLMD具有较高的挖掘精度和运行效率,且算法性能受部分参数影响。