A Unified Framework for Center-based Clustering of Distributed Data

We develop a family of distributed center-based clustering algorithms that work over networks of users. In the proposed scenario, users contain a local dataset and communicate only with their immediate neighbours, with the aim of finding a clustering of the full, joint data. The proposed family, termed Distributed Gradient Clustering (DGC-$\mathcal{F}_\rho$), is parametrized by $\rho \geq 1$, controling the proximity of users' center estimates, with $\mathcal{F}$ determining the clustering loss. Our framework allows for a broad class of smooth convex loss functions, including popular clustering losses like $K$-means and Huber loss. Specialized to popular clustering losses like $K$-means and Huber loss, DGC-$\mathcal{F}_\rho$ gives rise to novel distributed clustering algorithms DGC-KM$_\rho$ and DGC-HL$_\rho$, while novel clustering losses based on Logistic and Fair functions lead to DGC-LL$_\rho$ and DGC-FL$_\rho$. We provide a unified analysis and establish several strong results, under mild assumptions. First, we show that the sequence of centers generated by the methods converges to a well-defined notion of fixed point, under any center initialization and value of $\rho$. Second, we prove that, as $\rho$ increases, the family of fixed points produced by DGC-$\mathcal{F}_\rho$ converges to a notion of consensus fixed points. We show that consensus fixed points of DGC-$\mathcal{F}_{\rho}$ are equivalent to fixed points of gradient clustering over the full data, guaranteeing a clustering of the full data is produced. For the special case of Bregman losses, we show that our fixed points converge to the set of Lloyd points. Extensive numerical experiments on synthetic and real data confirm our theoretical findings, show strong performance of our methods and demonstrate the usefulness and wide range of potential applications of our general framework, such as outlier detection.

翻译：我们开发了一系列适用于用户网络的分布式中心化聚类算法。在所提出的场景中，用户持有本地数据集，并仅与其直接邻居通信，旨在实现对完整联合数据的聚类。该算法族称为分布式梯度聚类（DGC-$\mathcal{F}_\rho$），由参数$\rho \geq 1$控制用户中心估计的邻近程度，$\mathcal{F}$决定聚类损失函数。我们的框架支持一大类光滑凸损失函数，包括流行的聚类损失如$K$-均值和Huber损失。针对$K$-均值和Huber损失等流行聚类损失特化后，DGC-$\mathcal{F}_\rho$衍生出新型分布式聚类算法DGC-KM$_\rho$和DGC-HL$_\rho$，而基于Logistic函数和Fair函数的新型聚类损失则产生了DGC-LL$_\rho$和DGC-FL$_\rho$。我们在温和假设下提供统一分析并建立了若干强结论。首先，我们证明在任意中心初始化和$\rho$取值下，算法生成的中心序列会收敛到明确定义的不动点概念。其次，我们证明随着$\rho$增大，DGC-$\mathcal{F}_\rho$产生的不动点族会收敛到共识不动点概念。我们证明DGC-$\mathcal{F}_{\rho}$的共识不动点等价于在全数据上进行梯度聚类的不动点，从而保证产生完整数据的聚类结果。对于Bregman损失的特殊情况，我们证明算法的不动点会收敛到Lloyd点集合。在合成数据和真实数据上的大量数值实验证实了我们的理论发现，展示了算法的优越性能，并证明了我们通用框架在异常检测等领域的实用性和广泛潜在应用价值。