Nowadays, gathering high-quality training data from multiple data sources with privacy preservation is a crucial challenge to training high-performance machine learning models. The potential solutions could break the barriers among isolated data corpus, and consequently enlarge the range of data available for processing. To this end, both academic researchers and industrial vendors are recently strongly motivated to propose two main-stream folders of solutions mainly based on software constructions: 1) Secure Multi-party Learning (MPL for short); and 2) Federated Learning (FL for short). The above two technical folders have their advantages and limitations when we evaluate them according to the following five criteria: security, efficiency, data distribution, the accuracy of trained models, and application scenarios. Motivated to demonstrate the research progress and discuss the insights on the future directions, we thoroughly investigate these protocols and frameworks of both MPL and FL. At first, we define the problem of Training machine learning Models over Multiple data sources with Privacy Preservation (TMMPP for short). Then, we compare the recent studies of TMMPP from the aspects of the technical routes, the number of parties supported, data partitioning, threat model, and machine learning models supported, to show their advantages and limitations. Next, we investigate and evaluate five popular FL platforms. Finally, we discuss the potential directions to resolve the problem of TMMPP in the future.
翻译:当下,在保护隐私的前提下从多个数据源收集高质量训练数据,是训练高性能机器学习模型的关键挑战。潜在解决方案能够打破孤立数据孤岛之间的壁垒,从而扩大可处理数据的范围。为此,学术界和工业界近期均致力于提出两类基于软件构建的主流解决方案:1)安全多方学习(简称MPL);2)联邦学习(简称FL)。从安全性、效率、数据分布、训练模型精度及应用场景五个维度评估时,上述两类技术方案各有优劣。为展示研究进展并探讨未来方向,我们对MPL与FL的协议及框架进行了系统性研究。首先,我们定义了“隐私保护下跨多源训练机器学习模型(简称TMMPP)”问题。其次,从技术路线、支持方数量、数据划分方式、威胁模型及支持的机器学习模型等角度对比TMMPP近期研究成果,揭示其优势与局限。随后,我们考察并评估了五种主流FL平台。最后,探讨了未来解决TMMPP问题的潜在研究方向。