The use of distributed optimization in machine learning can be motivated either by the resulting preservation of privacy or the increase in computational efficiency. On the one hand, training data might be stored across multiple devices. Training a global model within a network where each node only has access to its confidential data requires the use of distributed algorithms. Even if the data is not confidential, sharing it might be prohibitive due to bandwidth limitations. On the other hand, the ever-increasing amount of available data leads to large-scale machine learning problems. By splitting the training process across multiple nodes its efficiency can be significantly increased. This paper aims to demonstrate how dual decomposition can be applied for distributed training of $ K $-means clustering problems. After an overview of distributed and federated machine learning, the mixed-integer quadratically constrained programming-based formulation of the $ K $-means clustering training problem is presented. The training can be performed in a distributed manner by splitting the data across different nodes and linking these nodes through consensus constraints. Finally, the performance of the subgradient method, the bundle trust method, and the quasi-Newton dual ascent algorithm are evaluated on a set of benchmark problems. While the mixed-integer programming-based formulation of the clustering problems suffers from weak integer relaxations, the presented approach can potentially be used to enable an efficient solution in the future, both in a central and distributed setting.
翻译:分布式优化在机器学习中的应用既可源于数据隐私保护的需求,也可出于提升计算效率的目的。一方面,训练数据可能分布在多个设备上。当每个节点仅能访问其机密数据时,在全局网络内训练模型需要使用分布式算法。即使数据非机密,带宽限制也可能阻碍数据共享。另一方面,可获取的数据量持续增长导致大规模机器学习问题涌现。通过将训练过程拆分至多个节点,可显著提升训练效率。本文旨在论证对偶分解方法如何应用于K均值聚类问题的分布式训练。在概述分布式与联邦机器学习后,本文提出了基于混合整数二次约束规划的K均值聚类训练问题建模。通过将数据分散至不同节点并引入共识约束建立节点关联,可实现分布式训练。最终,在基准测试问题上评估了次梯度法、束信任法及拟牛顿对偶上升法的性能表现。尽管基于混合整数规划的聚类问题建模存在整数松弛弱化的缺陷,但本文提出的方法有望在未来实现中心化与分布式场景下的高效求解。