Exact hierarchical agglomerative clustering (HAC) of large spatial datasets is limited in practice by the $\mathcal{O}(n^2)$ time and memory required for the full pairwise distance matrix. We present GSHAC (Geographically Sparse Hierarchical Agglomerative Clustering), a system that makes exact HAC feasible at scales of millions of geographic features on a commodity workstation. GSHAC replaces the distance matrix with a sparse geographic distance graph containing only pairs within a user-specified geodesic bound~$h_{\max}$, constructed in $\mathcal{O}(n \cdot k)$ time via spatial indexing, where~$k$ is the mean number of neighbors within~$h_{\max}$. Connected components of this graph define independent subproblems, and we prove that the resulting assignments are exact for all standard linkage methods at any cut height $h \le h_{\max}$. For single linkage, an MST-based path keeps memory at $\mathcal{O}(n_k + m_k)$ per component. Applied to a global mining inventory ($n = 261{,}073$), the system completes in 12\,s (109\,MiB peak HAC memory) versus $\approx 545$\,GiB for the dense baseline. On a 2-million-point GeoNames sample, all tested thresholds completed in under 3\,minutes with peak memory under 3\,GiB. We provide a scikit-learn-compatible implementation for direct integration into GIS workflows.
翻译:大规模空间数据集的精确层次凝聚聚类(HAC)在实际应用中受限于全对距离矩阵所需的$\mathcal{O}(n^2)$时间和内存。我们提出GSHAC(地理稀疏层次凝聚聚类),该系统可在商用工作站上处理百万级地理要素规模的精确HAC。GSHAC使用包含用户指定测地界限$h_{\max}$内点对的稀疏地理距离图替代距离矩阵,通过空间索引以$\mathcal{O}(n \cdot k)$时间复杂度构建,其中$k$为$h_{\max}$半径内的平均邻域数。该图的连通分量定义独立子问题,我们证明对任意切割高度$h \le h_{\max}$,所有标准链接方法均可获得精确聚类结果。针对单链接,基于最小生成树的路径使得每个分量内存保持$\mathcal{O}(n_k + m_k)$复杂度。应用于全球矿业清单($n = 261{,}073$个点),该系统在12秒内完成(HAC峰值内存109 MiB),而稠密基线方法需约545 GiB。在包含200万个点的GeoNames样本上,所有测试阈值均在3分钟内完成,峰值内存低于3 GiB。我们提供兼容scikit-learn的实现,可直接集成至地理信息系统工作流程。