Accelerating spherical K-means clustering for large-scale sparse document data

This paper presents an accelerated spherical K-means clustering algorithm for large-scale and high-dimensional sparse document data sets. We design an algorithm working in an architecture-friendly manner (AFM), which is a procedure of suppressing performance-degradation factors such as the numbers of instructions, branch mispredictions, and cache misses in CPUs of a modern computer system. For the AFM operation, we leverage unique universal characteristics (UCs) of a data-object and a cluster's mean set, which are skewed distributions on data relationships such as Zipf's law and a feature-value concentration phenomenon. The UCs indicate that the most part of the number of multiplications for similarity calculations is executed regarding terms with high document frequencies (df) and the most part of a similarity between an object- and a mean-feature vector is obtained by the multiplications regarding a few high mean-feature values. Our proposed algorithm applies an inverted-index data structure to a mean set, extracts the specific region with high-df terms and high mean-feature values in the mean-inverted index by newly introduced two structural parameters, and exploits the index divided into three parts for efficient pruning. The algorithm determines the two structural parameters by minimizing the approximate number of multiplications related to that of instructions, reduces the branch mispredictions by sharing the index structure including the two parameters with all the objects, and suppressing the cache misses by keeping in the caches the frequently used data in the foregoing specific region, resulting in working in the AFM. We experimentally demonstrate that our algorithm efficiently achieves superior speed performance in large-scale documents compared with algorithms using the state-of-the-art techniques.

翻译：本文提出一种面向大规模高维稀疏文档数据集的加速球形K-means聚类算法。我们设计了一种以架构友好模式运行的算法，该模式通过抑制现代计算机系统CPU中的指令数、分支预测失误和缓存未命中等性能退化因素来优化流程。为实现架构友好操作，我们利用数据对象与聚类均值集合特有的普遍特性——即数据关系上呈现的偏态分布现象，如齐夫定律和特征值集中现象。这些普遍特性表明：相似度计算中的大部分乘法运算集中于高文档频率的词汇项；而对象特征向量与均值特征向量之间的相似度主要由少数高均值特征值对应的乘法运算贡献。本算法将倒排索引数据结构应用于均值集合，通过新引入的两个结构参数在均值倒排索引中提取高文档频率词汇项与高均值特征值构成的特定区域，并利用被划分为三部分的索引实现高效剪枝。该算法通过最小化与指令数相关的近似乘法运算量来确定两个结构参数，通过所有对象共享包含这两个参数的索引结构来减少分支预测失误，并通过在缓存中持续保留前述特定区域的高频使用数据来抑制缓存未命中，从而实现架构友好运行模式。实验结果表明，相较于采用最先进技术的算法，本算法在大规模文档数据上能高效实现更优的速度性能。