Data silos, mainly caused by privacy and interoperability, significantly constrain collaborations among different organizations with similar data for the same purpose. Distributed learning based on divide-and-conquer provides a promising way to settle the data silos, but it suffers from several challenges, including autonomy, privacy guarantees, and the necessity of collaborations. This paper focuses on developing an adaptive distributed kernel ridge regression (AdaDKRR) by taking autonomy in parameter selection, privacy in communicating non-sensitive information, and the necessity of collaborations in performance improvement into account. We provide both solid theoretical verification and comprehensive experiments for AdaDKRR to demonstrate its feasibility and effectiveness. Theoretically, we prove that under some mild conditions, AdaDKRR performs similarly to running the optimal learning algorithms on the whole data, verifying the necessity of collaborations and showing that no other distributed learning scheme can essentially beat AdaDKRR under the same conditions. Numerically, we test AdaDKRR on both toy simulations and two real-world applications to show that AdaDKRR is superior to other existing distributed learning schemes. All these results show that AdaDKRR is a feasible scheme to defend against data silos, which are highly desired in numerous application regions such as intelligent decision-making, pricing forecasting, and performance prediction for products.
翻译:数据孤岛主要由隐私性和互操作性引起,严重限制了具有相似数据的不同组织之间为相同目的开展的合作。基于分治法的分布式学习为解决数据孤岛提供了一种有前景的方法,但面临包括自主性、隐私保证以及合作必要性在内的诸多挑战。本文致力于开发一种自适应分布式核岭回归(AdaDKRR),该方案综合考虑了参数选择的自主性、非敏感信息通信的隐私性以及通过合作提升性能的必要性。我们为AdaDKRR提供了坚实的理论验证和全面的实验,以展示其可行性和有效性。理论上,我们证明在温和条件下,AdaDKRR的性能与在整个数据上运行最优学习算法相似,验证了合作的必要性,并表明在同一条件下,没有任何其他分布式学习方案能本质上超越AdaDKRR。数值上,我们在玩具仿真和两个实际应用中对AdaDKRR进行了测试,结果表明AdaDKRR优于其他现有分布式学习方案。所有结果均表明,AdaDKRR是抵御数据孤岛的可行方案,这在智能决策、定价预测及产品性能预测等众多应用领域具有高度需求。