FLea: Addressing Data Scarcity and Label Skew in Federated Learning via Privacy-preserving Feature Augmentation

Federated Learning (FL) enables model development by leveraging data distributed across numerous edge devices without transferring local data to a central server. However, existing FL methods still face challenges when dealing with scarce and label-skewed data across devices, resulting in local model overfitting and drift, consequently hindering the performance of the global model. In response to these challenges, we propose a pioneering framework called \textit{FLea}, incorporating the following key components: \textit{i)} A global feature buffer that stores activation-target pairs shared from multiple clients to support local training. This design mitigates local model drift caused by the absence of certain classes; \textit{ii)} A feature augmentation approach based on local and global activation mix-ups for local training. This strategy enlarges the training samples, thereby reducing the risk of local overfitting; \textit{iii)} An obfuscation method to minimize the correlation between intermediate activations and the source data, enhancing the privacy of shared features. To verify the superiority of \textit{FLea}, we conduct extensive experiments using a wide range of data modalities, simulating different levels of local data scarcity and label skew. The results demonstrate that \textit{FLea} consistently outperforms state-of-the-art FL counterparts (among 13 of the experimented 18 settings, the improvement is over $5\%$) while concurrently mitigating the privacy vulnerabilities associated with shared features. Code is available at https://github.com/XTxiatong/FLea.git

翻译：联邦学习（Federated Learning, FL）通过利用分布在众多边缘设备上的数据来开发模型，而无需将本地数据传输至中央服务器。然而，现有联邦学习方法在处理设备间稀缺且标签偏斜的数据时仍面临挑战，导致本地模型过拟合与漂移，进而影响全局模型的性能。针对这些挑战，我们提出了一种创新性框架 \textit{FLea}，其包含以下关键组成部分：\textit{i)} 一个全局特征缓冲区，用于存储来自多个客户端的激活-目标对以支持本地训练。该设计缓解了因某些类别缺失而导致的本地模型漂移；\textit{ii)} 一种基于本地与全局激活混合的特征增强方法，用于本地训练。该策略通过扩大训练样本规模，降低了本地过拟合的风险；\textit{iii)} 一种混淆方法，用于最小化中间激活与源数据之间的关联性，从而增强共享特征的隐私性。为验证 \textit{FLea} 的优越性，我们使用多种数据模态进行了广泛实验，模拟了不同程度的本地数据稀缺性与标签偏斜。结果表明，\textit{FLea} 在缓解共享特征相关隐私漏洞的同时，始终优于现有的先进联邦学习方法（在实验的18种设置中，有13种设置的性能提升超过 $5\%$）。代码发布于 https://github.com/XTxiatong/FLea.git。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日