Heterogeneity across devices in federated learning (FL) typically refers to statistical (e.g., non-i.i.d. data distributions) and resource (e.g., communication bandwidth) dimensions. In this paper, we focus on another important dimension that has received less attention: varying quantities/distributions of labeled and unlabeled data across devices. In order to leverage all data, we develop a decentralized federated domain adaptation methodology which considers the transfer of ML models from devices with high quality labeled data (called sources) to devices with low quality or unlabeled data (called targets). Our methodology, Source-Target Determination and Link Formation (ST-LF), optimizes both (i) classification of devices into sources and targets and (ii) source-target link formation, in a manner that considers the trade-off between ML model accuracy and communication energy efficiency. To obtain a concrete objective function, we derive a measurable generalization error bound that accounts for estimates of source-target hypothesis deviations and divergences between data distributions. The resulting optimization problem is a mixed-integer signomial program, a class of NP-hard problems, for which we develop an algorithm based on successive convex approximations to solve it tractably. Subsequent numerical evaluations of ST-LF demonstrate that it improves classification accuracy and energy efficiency over state-of-the-art baselines.
翻译:联邦学习(FL)中设备间的异质性通常体现在统计(例如,非独立同分布数据分布)和资源(例如,通信带宽)维度上。本文关注另一个重要但较少被关注的维度:不同设备上标记与未标记数据的数量/分布差异。为充分利用所有数据,我们提出一种去中心化联邦域适应方法,该方法考虑将机器学习模型从具有高质量标记数据的设备(称为源)迁移至低质量或无标记数据的设备(称为目标)。我们的方法——源目标判定与链路形成(ST-LF)——同时优化(i)设备分类为源与目标,以及(ii)源-目标链路形成,并在过程中兼顾机器学习模型精度与通信能量效率间的权衡。为获得具体目标函数,我们推导了一个可测的泛化误差界,该误差界考虑了源-目标假设偏差与数据分布差异的估计值。由此产生的优化问题为混合整数符号规划(一类NP难问题),为此我们基于逐次凸逼近开发了一种可解算法。后续对ST-LF的数值评估表明,相较于现有最先进的基线方法,该方法在分类精度与能量效率上均有所提升。