We present algorithms for the computation of $\varepsilon$-coresets for $k$-median clustering of point sequences in $\mathbb{R}^d$ under the $p$-dynamic time warping (DTW) distance. Coresets under DTW have not been investigated before, and the analysis is not directly accessible to existing methods as DTW is not a metric. The three main ingredients that allow our construction of coresets are the adaptation of the $\varepsilon$-coreset framework of sensitivity sampling, bounds on the VC dimension of approximations to the range spaces of balls under DTW, and new approximation algorithms for the $k$-median problem under DTW. We achieve our results by investigating approximations of DTW that provide a trade-off between the provided accuracy and amenability to known techniques. In particular, we observe that given $n$ curves under DTW, one can directly construct a metric that approximates DTW on this set, permitting the use of the wealth of results on metric spaces for clustering purposes. The resulting approximations are the first with polynomial running time and achieve a very similar approximation factor as state-of-the-art techniques. We apply our results to produce a practical algorithm approximating $(k,\ell)$-median clustering under DTW.
翻译:我们提出了在p-动态时间弯曲(DTW)距离下,计算$\mathbb{R}^d$中点序列$k$-中位数聚类的$\varepsilon$-核心集算法。此前尚无针对DTW核心集的研究,且由于DTW并非度量,现有方法无法直接应用于相关分析。构建核心集的三个主要要素包括:适应灵敏度采样的$\varepsilon$-核心集框架、DTW下球空间近似的VC维界、以及DTW下$k$-中位数问题的新近似算法。我们通过研究DTW的近似方法实现上述成果,这些近似在精度与现有技术的适用性之间提供了权衡。特别地,我们观察到给定$n$条DTW曲线后,可直接构建一个近似该集合上DTW的度量,从而允许利用度量空间中的丰富结果进行聚类。所得近似是首个具有多项式运行时间的方法,其近似因子与现有最优技术高度接近。我们将研究成果应用于实际算法,以近似DTW下的$(k,\ell)$-中位数聚类。