Clustering clients into groups that exhibit relatively homogeneous data distributions represents one of the major means of improving the performance of federated learning (FL) in non-independent and identically distributed (non-IID) data settings. Yet, the applicability of current state-of-the-art approaches remains limited as these approaches cluster clients based on information, such as the evolution of local model parameters, that is only obtainable through actual on-client training. On the other hand, there is a need to make FL models available to clients who are not able to perform the training themselves, as they do not have the processing capabilities required for training, or simply want to use the model without participating in the training. Furthermore, the existing alternative approaches that avert the training still require that individual clients have a sufficient amount of labeled data upon which the clustering is based, essentially assuming that each client is a data annotator. In this paper, we present REPA, an approach to client clustering in non-IID FL settings that requires neither training nor labeled data collection. REPA uses a novel supervised autoencoder-based method to create embeddings that profile a client's underlying data-generating processes without exposing the data to the server and without requiring local training. Our experimental analysis over three different datasets demonstrates that REPA delivers state-of-the-art model performance while expanding the applicability of cluster-based FL to previously uncovered use cases.
翻译:将客户端按照数据分布的同质性进行聚类,是改善非独立同分布(non-IID)数据场景下联邦学习(FL)性能的主要手段之一。然而,现有最优方法的适用性仍受限于其聚类依赖的信息(如本地模型参数演变)必须通过实际客户端训练才能获取。另一方面,部分客户端因缺乏训练所需的处理能力或仅希望使用模型而不参与训练,无法自行完成模型训练,因此亟需使FL模型对这些客户端开放可用。此外,当前无需训练的其他替代方案仍要求每个客户端拥有足够数量的带标签数据作为聚类依据,这本质上是假设每个客户端均为数据标注者。本文提出REPA方法,可在非IID的FL场景中实现无需训练和无需收集标签数据的客户端聚类。REPA采用基于新型监督自编码器的方法,在不向服务器暴露数据且无需本地训练的前提下,生成刻画客户端底层数据生成过程的嵌入表征。基于三个不同数据集的实验分析表明,REPA在实现最优模型性能的同时,将基于聚类的FL方法扩展至此前未覆盖的应用场景。