The CLASSIX algorithm is a fast and explainable approach to data clustering. In its original form, this algorithm exploits the sorting of the data points by their first principal component to truncate the search for nearby data points, with nearness being defined in terms of the Euclidean distance. Here we extend CLASSIX to other distance metrics, including the Manhattan distance and the Tanimoto distance. Instead of principal components, we use an appropriate norm of the data vectors as the sorting criterion, combined with the triangle inequality for search termination. In the case of Tanimoto distance, a provably sharper intersection inequality is used to further boost the performance of the new algorithm. On a real-world chemical fingerprint benchmark, CLASSIX Tanimoto is about 30 times faster than the Taylor--Butina algorithm, and about 80 times faster than DBSCAN, while computing higher-quality clusters in both cases.
翻译:CLASSIX算法是一种快速且可解释的数据聚类方法。在其原始形式中,该算法通过对数据点按其第一主成分排序来截断对邻近数据点的搜索,邻近性由欧几里得距离定义。本文我们将CLASSIX扩展到其他距离度量,包括曼哈顿距离和谷本距离。我们使用数据向量的适当范数作为排序标准,并结合三角不等式进行搜索终止,以替代主成分。在谷本距离的情况下,采用可证明更严格的交集不等式来进一步提升新算法的性能。在真实世界的化学指纹基准测试中,CLASSIX Tanimoto算法比Taylor–Butina算法快约30倍,比DBSCAN快约80倍,同时在两种情况下均计算出更高质量的聚类。