Community detection becomes an important problem with the booming of social networks. As an excellent clustering algorithm, Mean-Shift can not be applied directly to community detection, since Mean-Shift can only handle data with coordinates, while the data in the community detection problem is mostly represented by a graph that can be treated as data with a distance matrix (or similarity matrix). Fortunately, a new clustering algorithm called Medoid-Shift is proposed. The Medoid-Shift algorithm preserves the benefits of Mean-Shift and can be applied to problems based on distance matrix, such as community detection. One drawback of the Medoid-Shift algorithm is that there may be no data points within the neighborhood region defined by a distance parameter. To deal with the community detection problem better, a new algorithm called Revised Medoid-Shift (RMS) in this work is thus proposed. During the process of finding the next medoid, the RMS algorithm is based on a neighborhood defined by KNN, while the original Medoid-Shift is based on a neighborhood defined by a distance parameter. Since the neighborhood defined by KNN is more stable than the one defined by the distance parameter in terms of the number of data points within the neighborhood, the RMS algorithm may converge more smoothly. In the RMS method, each of the data points is shifted towards a medoid within the neighborhood defined by KNN. After the iterative process of shifting, each of the data point converges into a cluster center, and the data points converging into the same center are grouped into the same cluster.
翻译:随着社交网络的蓬勃发展,社区检测成为重要问题。作为一种优秀的聚类算法,Mean-Shift无法直接应用于社区检测,因为Mean-Shift只能处理具有坐标的数据,而社区检测问题中的数据大多以图的形式表示,可视为带有距离矩阵(或相似性矩阵)的数据。幸运的是,一种名为Medoid-Shift的新型聚类算法被提出。该算法保留了Mean-Shift的优势,可应用于基于距离矩阵的问题(如社区检测)。Medoid-Shift算法的一个缺陷是:在由距离参数定义的邻域内可能不存在数据点。为更好地解决社区检测问题,本文提出一种名为改进Medoid-Shift(RMS)的新算法。在寻找下一个Medoid的过程中,RMS算法基于KNN定义的邻域,而原始Medoid-Shift基于距离参数定义的邻域。由于KNN定义的邻域在数据点数量方面比距离参数定义的邻域更稳定,RMS算法可能收敛得更加平滑。在RMS方法中,每个数据点都会向KNN邻域内的某个Medoid移动。经过迭代移动过程后,每个数据点收敛到一个聚类中心,且收敛到相同中心的数据点被归入同一聚类。