Clustering analysis is of substantial significance for data mining. The properties of big data raise higher demand for more efficient and economical distributed clustering methods. However, existing distributed clustering methods mainly focus on the size of data but ignore possible problems caused by data dimension. To solve this problem, we propose a new distributed algorithm, referred to as Hashing-Based Distributed Clustering (HBDC). Motivated by the outstanding performance of hashing methods for nearest neighbor searching, this algorithm applies the learning-to-hash technique to the clustering problem, which possesses incomparable advantages for data storage, transmission and computation. Following a global-sub-site paradigm, the HBDC consists of distributed training of hashing network and spectral clustering for hash codes at the global site. The sub-sites use the learnable network as a hash function to convert massive HD original data into a small number of hash codes, and send them to the global site for final clustering. In addition, a sample-selection method and slight network structures are designed to accelerate the convergence of the hash network. We also analyze the transmission cost of HBDC, including the upper bound. Our experiments on synthetic and real datasets illustrate the superiority of HBDC compared with existing state-of-the-art algorithms.
翻译:聚类分析对数据挖掘具有重要意义。大数据的特性对更高效且经济节约的分布式聚类方法提出了更高要求。然而,现有分布式聚类方法主要关注数据规模,却忽视了数据维度可能带来的问题。为解决此问题,我们提出一种名为"基于哈希的分布式聚类"(HBDC)的新分布式算法。受哈希方法在最近邻搜索中卓越性能的启发,该算法将学习型哈希技术应用于聚类问题,在数据存储、传输和计算方面具有无可比拟的优势。采用全局-子站点范式,HBDC包括子站点分布式训练哈希网络和全局站点的哈希码谱聚类两个阶段。子站点利用可学习网络作为哈希函数,将海量高维原始数据转换为少量哈希码,并将其发送至全局站点进行最终聚类。此外,我们设计了样本选择方法和轻量级网络结构以加速哈希网络的收敛,并分析了HBDC的传输开销(包括上界)。在合成数据集和真实数据集上的实验结果表明,HBDC相比现有最优算法具有显著优势。