Current federated learning (FL) approaches view decentralized training data as a single table, divided among participants either horizontally (by rows) or vertically (by columns). However, these approaches are inadequate for handling distributed relational tables across databases. This scenario requires intricate SQL operations like joins and unions to obtain the training data, which is either costly or restricted by privacy concerns. This raises the question: can we directly run FL on distributed relational tables? In this paper, we formalize this problem as relational federated learning (RFL). We propose TablePuppet, a generic framework for RFL that decomposes the learning process into two steps: (1) learning over join (LoJ) followed by (2) learning over union (LoU). In a nutshell, LoJ pushes learning down onto the vertical tables being joined, and LoU further pushes learning down onto the horizontal partitions of each vertical table. TablePuppet incorporates computation/communication optimizations to deal with the duplicate tuples introduced by joins, as well as differential privacy (DP) to protect against both feature and label leakages. We demonstrate the efficiency of TablePuppet in combination with two widely-used ML training algorithms, stochastic gradient descent (SGD) and alternating direction method of multipliers (ADMM), and compare their computation/communication complexity. We evaluate the SGD/ADMM algorithms developed atop TablePuppet by training diverse ML models. Our experimental results show that TablePuppet achieves model accuracy comparable to the centralized baselines running directly atop the SQL results. Moreover, ADMM takes less communication time than SGD to converge to similar model accuracy.
翻译:当前联邦学习(FL)方法将分散的训练数据视为单张表,参与者之间通过水平(按行)或垂直(按列)方式进行分割。然而,这些方法无法有效处理跨数据库的分布式关系表。此类场景需要通过连接、并集等复杂SQL操作获取训练数据,这既成本高昂又受隐私限制。因此引发一个问题:能否直接对分布式关系表执行FL?本文将此问题形式化为关系型联邦学习(RFL)。我们提出TablePuppet,一种RFL通用框架,将学习过程分解为两步:(1)连接后学习(LoJ)与(2)并集后学习(LoU)。简而言之,LoJ将学习下推到被连接的垂直表中,LoU进一步将学习下推到每个垂直表的水平分区中。TablePuppet集成了计算/通信优化以处理连接引入的重复元组,并采用差分隐私(DP)防止特征和标签泄露。我们结合两种广泛使用的机器学习训练算法(随机梯度下降SGD和交替方向乘子法ADMM)证明了TablePuppet的效率,并比较了其计算/通信复杂度。通过训练多种机器学习模型,评估了基于TablePuppet开发的SGD/ADMM算法。实验结果表明,TablePuppet达到了与直接基于SQL结果的集中式基线相当的模型精度。此外,在收敛到相似模型精度时,ADMM比SGD消耗更少的通信时间。