Heterogeneity across devices in federated learning (FL) typically refers to statistical (e.g., non-i.i.d. data distributions) and resource (e.g., communication bandwidth) dimensions. In this paper, we focus on another important dimension that has received less attention: varying quantities/distributions of labeled and unlabeled data across devices. In order to leverage all data, we develop a decentralized federated domain adaptation methodology which considers the transfer of ML models from devices with high quality labeled data (called sources) to devices with low quality or unlabeled data (called targets). Our methodology, Source-Target Determination and Link Formation (ST-LF), optimizes both (i) classification of devices into sources and targets and (ii) source-target link formation, in a manner that considers the trade-off between ML model accuracy and communication energy efficiency. To obtain a concrete objective function, we derive a measurable generalization error bound that accounts for estimates of source-target hypothesis deviations and divergences between data distributions. The resulting optimization problem is a mixed-integer signomial program, a class of NP-hard problems, for which we develop an algorithm based on successive convex approximations to solve it tractably. Subsequent numerical evaluations of ST-LF demonstrate that it improves classification accuracy and energy efficiency over state-of-the-art baselines.
翻译:联邦学习(FL)中设备间的异质性通常体现在统计(如非独立同分布的数据分布)和资源(如通信带宽)维度上。本文聚焦于另一个较少受到关注的重要维度:设备间标记数据和未标记数据数量/分布的差异。为充分利用所有数据,我们提出了一种去中心化联邦域适应方法,该方法考虑将机器学习模型从拥有高质量标记数据的设备(称为源)迁移到数据质量低或未标记的设备(称为目标)。我们的方法——源-目标确定与链路形成(Source-Target Determination and Link Formation, ST-LF)——在考虑机器学习模型精度与通信能效之间权衡的同时,优化了(i)设备分类为源和目标,以及(ii)源-目标链路形成。为获得具体的优化目标函数,我们推导了一个可测量的泛化误差界,该误差界考虑了源-目标假设偏差估计和数据分布之间的散度。由此产生的优化问题是一个混合整数符号规划,这类问题属于NP-hard问题,我们开发了一种基于逐次凸逼近的算法来有效求解。后续对ST-LF的数值评估表明,与现有最优基线方法相比,该方法在分类精度和能效方面均有提升。