To Store or Not? Online Data Selection for Federated Learning with Limited Storage

Machine learning models have been deployed in mobile networks to deal with massive data from different layers to enable automated network management and intelligence on devices. To overcome high communication cost and severe privacy concerns of centralized machine learning, federated learning (FL) has been proposed to achieve distributed machine learning among networked devices. While the computation and communication limitation has been widely studied, the impact of on-device storage on the performance of FL is still not explored. Without an effective data selection policy to filter the massive streaming data on devices, classical FL can suffer from much longer model training time ($4\times$) and significant inference accuracy reduction ($7\%$), observed in our experiments. In this work, we take the first step to consider the online data selection for FL with limited on-device storage. We first define a new data valuation metric for data evaluation and selection in FL with theoretical guarantees for speeding up model convergence and enhancing final model accuracy, simultaneously. We further design {\ttfamily ODE}, a framework of \textbf{O}nline \textbf{D}ata s\textbf{E}lection for FL, to coordinate networked devices to store valuable data samples. Experimental results on one industrial dataset and three public datasets show the remarkable advantages of {\ttfamily ODE} over the state-of-the-art approaches. Particularly, on the industrial dataset, {\ttfamily ODE} achieves as high as $2.5\times$ speedup of training time and $6\%$ increase in inference accuracy, and is robust to various factors in practical environments.

翻译：机器学习模型已部署于移动网络中以处理来自不同层级的海量数据，从而实现自动化网络管理与设备端智能。为克服集中式机器学习的高通信成本与严重隐私问题，联邦学习（FL）被提出以在网络设备间实现分布式机器学习。尽管计算与通信限制已被广泛研究，但设备端存储对联邦学习性能的影响仍未得到探索。若无有效的数据选择策略来过滤设备上的海量流式数据，经典联邦学习将面临模型训练时间显著延长（$4\times$）及推理精度大幅下降（$7\%$）的问题——这在我们实验中均有观察到。本研究首次考虑面向有限设备端存储的联邦学习在线数据选择问题。我们首先定义了一种新的数据价值评估指标，用于联邦学习中的数据评估与选择，该指标在理论上能同时保证加速模型收敛并提升最终模型精度。进一步设计了{\ttfamily ODE}联邦学习在线数据选择框架，以协调网络设备存储有价值的数据样本。在1个工业数据集和3个公开数据集上的实验结果表明，{\ttfamily ODE}相较于现有最优方法具有显著优势。特别地，在工业数据集上，{\ttfamily ODE}实现了高达$2.5\times$的训练加速比和$6\%$的推理精度提升，并对实际环境中的多种因素具有鲁棒性。