Deduplication is a vital preprocessing step that enhances machine learning model performance and saves training time and energy. However, enhancing federated learning through deduplication poses challenges, especially regarding scalability and potential privacy violations if deduplication involves sharing all clients' data. In this paper, we address the problem of deduplication in a federated setup by introducing a pioneering protocol, Efficient Privacy-Preserving Multi-Party Deduplication (EP-MPD). It efficiently removes duplicates from multiple clients' datasets without compromising data privacy. EP-MPD is constructed in a modular fashion, utilizing two novel variants of the Private Set Intersection protocol. Our extensive experiments demonstrate the significant benefits of deduplication in federated learning of large language models. For instance, we observe up to 19.61% improvement in perplexity and up to 27.95% reduction in running time. EP-MPD effectively balances privacy and performance in federated learning, making it a valuable solution for large-scale applications.
翻译:数据去重是提升机器学习模型性能、节省训练时间与能耗的关键预处理步骤。然而,在联邦学习中实施去重面临诸多挑战,特别是在可扩展性方面;若去重过程涉及共享所有客户端数据,还可能引发隐私泄露风险。本文针对联邦学习场景下的数据去重问题,提出了一种创新协议——高效隐私保护多方去重协议。该协议能在不泄露数据隐私的前提下,高效消除多客户端数据集中的重复样本。EP-MPD采用模块化架构设计,融合了两种新型隐私集合求交协议变体。我们通过大量实验证明,在大语言模型的联邦学习中进行数据去重具有显著优势:例如,困惑度指标最高提升19.61%,运行时间最多减少27.95%。EP-MPD在联邦学习中实现了隐私保护与模型性能的有效平衡,为大规模应用提供了有价值的解决方案。