While federated learning (FL) promises to preserve privacy, recent works in the image and text domains have shown that training updates leak private client data. However, most high-stakes applications of FL (e.g., in healthcare and finance) use tabular data, where the risk of data leakage has not yet been explored. A successful attack for tabular data must address two key challenges unique to the domain: (i) obtaining a solution to a high-variance mixed discrete-continuous optimization problem, and (ii) enabling human assessment of the reconstruction as unlike for image and text data, direct human inspection is not possible. In this work we address these challenges and propose TabLeak, the first comprehensive reconstruction attack on tabular data. TabLeak is based on two key contributions: (i) a method which leverages a softmax relaxation and pooled ensembling to solve the optimization problem, and (ii) an entropy-based uncertainty quantification scheme to enable human assessment. We evaluate TabLeak on four tabular datasets for both FedSGD and FedAvg training protocols, and show that it successfully breaks several settings previously deemed safe. For instance, we extract large subsets of private data at >90% accuracy even at the large batch size of 128. Our findings demonstrate that current high-stakes tabular FL is excessively vulnerable to leakage attacks.
翻译:尽管联邦学习(FL)承诺保护隐私,但最近在图像和文本领域的研究表明,训练更新会泄露客户端的私有数据。然而,大多数高风险FL应用(例如医疗保健和金融)使用表格数据,其中数据泄露的风险尚未被探索。针对表格数据的成功攻击必须解决该领域特有的两个关键挑战:(i) 获得高方差混合离散-连续优化问题的解,以及(ii) 实现对重建结果的人工评估——与图像和文本数据不同,直接人工检查是不可行的。在本工作中,我们解决了这些挑战,并提出了TabLeak,这是首个针对表格数据的全面重建攻击。TabLeak基于两项关键贡献:(i) 一种利用softmax松弛和池化集成求解优化问题的方法,以及(ii) 一种基于熵的不确定性量化方案以实现人工评估。我们在四个表格数据集上对TabLeak进行了FedSGD和FedAvg训练协议的评估,并证明它成功攻破了若干此前被认为安全的设置。例如,即使在128的大批量大小下,我们也能以超过90%的准确率提取出大量的私有数据子集。我们的研究结果表明,当前高风险的表格联邦学习极其容易受到泄露攻击。