To Store or Not? Online Data Selection for Federated Learning with Limited Storage

Machine learning models have been deployed in mobile networks to deal with massive data from different layers to enable automated network management and intelligence on devices. To overcome high communication cost and severe privacy concerns of centralized machine learning, federated learning (FL) has been proposed to achieve distributed machine learning among networked devices. While the computation and communication limitation has been widely studied, the impact of on-device storage on the performance of FL is still not explored. Without an effective data selection policy to filter the massive streaming data on devices, classical FL can suffer from much longer model training time ($4\times$) and significant inference accuracy reduction ($7\%$), observed in our experiments. In this work, we take the first step to consider the online data selection for FL with limited on-device storage. We first define a new data valuation metric for data evaluation and selection in FL with theoretical guarantees for speeding up model convergence and enhancing final model accuracy, simultaneously. We further design {\ttfamily ODE}, a framework of \textbf{O}nline \textbf{D}ata s\textbf{E}lection for FL, to coordinate networked devices to store valuable data samples. Experimental results on one industrial dataset and three public datasets show the remarkable advantages of {\ttfamily ODE} over the state-of-the-art approaches. Particularly, on the industrial dataset, {\ttfamily ODE} achieves as high as $2.5\times$ speedup of training time and $6\%$ increase in inference accuracy, and is robust to various factors in practical environments.

翻译：机器学习模型已被部署于移动网络中，用于处理来自不同层级的海量数据，从而实现自动化网络管理与设备端智能。为克服集中式机器学习的高通信成本与严重隐私问题，联邦学习（FL）被提出以实现网络设备间的分布式机器学习。尽管计算与通信限制已得到广泛研究，但设备端存储对联邦学习性能的影响仍未被充分探索。若缺乏有效的数据选择策略来过滤设备上的流式数据，经典联邦学习在实验中会面临模型训练时长显著增加（4倍）和推理精度明显下降（7%）的问题。本文首次针对有限设备端存储的场景，研究联邦学习中的在线数据选择问题。我们首先提出了一种新的数据估值指标，用于联邦学习中的数据评估与选择，该指标在理论上可同时加速模型收敛并提升最终模型精度。进一步地，我们设计了{\ttfamily ODE}——一种面向联邦学习的在线数据选择框架，用于协调网络设备存储有价值的数据样本。在一个工业数据集与三个公开数据集上的实验结果显示，{\ttfamily ODE}相比现有最优方法具有显著优势。具体而言，在工业数据集上，{\ttfamily ODE}实现了高达2.5倍的训练速度提升与6%的推理精度增幅，且对实际环境中的多种因素具有鲁棒性。