A wide range of (multivariate) temporal (1D) and spatial (2D) data analysis tasks, such as grouping vehicle sensor trajectories, can be formulated as clustering with given metric constraints. Existing metric-constrained clustering algorithms overlook the rich correlation between feature similarity and metric distance, i.e., metric autocorrelation. The model-based variations of these clustering algorithms (e.g. TICC and STICC) achieve SOTA performance, yet suffer from computational instability and complexity by using a metric-constrained Expectation-Maximization procedure. In order to address these two problems, we propose a novel clustering algorithm, MC-GTA (Model-based Clustering via Goodness-of-fit Tests with Autocorrelations). Its objective is only composed of pairwise weighted sums of feature similarity terms (square Wasserstein-2 distance) and metric autocorrelation terms (a novel multivariate generalization of classic semivariogram). We show that MC-GTA is effectively minimizing the total hinge loss for intra-cluster observation pairs not passing goodness-of-fit tests, i.e., statistically not originating from the same distribution. Experiments on 1D/2D synthetic and real-world datasets demonstrate that MC-GTA successfully incorporates metric autocorrelation. It outperforms strong baselines by large margins (up to 14.3% in ARI and 32.1% in NMI) with faster and stabler optimization (>10x speedup).
翻译:大量(多元)时序(一维)与空间(二维)数据分析任务(例如车辆传感器轨迹分组)可被形式化为带给定度量约束的聚类问题。现有度量约束聚类算法忽略了特征相似性与度量距离之间丰富的相关性,即度量自相关性。此类算法的模型化变体(如TICC和STICC)虽达到当前最优性能,但因其采用度量约束的期望最大化过程而存在计算不稳定性和复杂度问题。为解决这两类问题,我们提出一种新颖的聚类算法MC-GTA(基于自相关拟合优度检验的模型聚类)。其目标函数仅由特征相似性项(平方Wasserstein-2距离)与度量自相关项(经典半变异函数的全新多元推广形式)的成对加权和构成。我们证明MC-GTA本质上是为未通过拟合优度检验(即统计上不源于同一分布)的簇内观测对最小化总合页损失。在一维/二维合成数据集与真实数据集上的实验表明,MC-GTA成功融合了度量自相关性,以更快更稳定的优化速度(>10倍加速)大幅超越强基线方法(ARI提升达14.3%,NMI提升达32.1%)。