Large machine-learning training datasets can be distilled into small collections of informative synthetic data samples. These synthetic sets support efficient model learning and reduce the communication cost of data sharing. Thus, high-fidelity distilled data can support the efficient deployment of machine learning applications in distributed network environments. A naive way to construct a synthetic set in a distributed environment is to allow each client to perform local data distillation and to merge local distillations at a central server. However, the quality of the resulting set is impaired by heterogeneity in the distributions of the local data held by clients. To overcome this challenge, we introduce the first collaborative data distillation technique, called CollabDM, which captures the global distribution of the data and requires only a single round of communication between client and server. Our method outperforms the state-of-the-art one-shot learning method on skewed data in distributed learning environments. We also show the promising practical benefits of our method when applied to attack detection in 5G networks.
翻译:大型机器学习训练数据集可被蒸馏为少量信息丰富的合成数据样本集合。这些合成数据集支持高效的模型学习,并降低数据共享的通信成本。因此,高保真度的蒸馏数据能够支持机器学习应用在分布式网络环境中的高效部署。在分布式环境中构建合成数据集的一种简单方法是允许每个客户端执行本地数据蒸馏,并在中央服务器合并本地蒸馏结果。然而,由于客户端持有的本地数据分布存在异质性,最终合成集的质量会受到损害。为克服这一挑战,我们提出了首个协同数据蒸馏技术CollabDM,该方法能捕捉数据的全局分布,且仅需客户端与服务器之间进行单轮通信。我们的方法在分布式学习环境中处理倾斜数据时,性能优于当前最先进的单次学习方法。我们还展示了该方法应用于5G网络攻击检测时所具有的广阔实用前景。