We present algorithms for the computation of $\varepsilon$-coresets for $k$-median clustering of point sequences in $\mathbb{R}^d$ under the $p$-dynamic time warping (DTW) distance. Coresets under DTW have not been investigated before, and the analysis is not directly accessible to existing methods as DTW is not a metric. The three main ingredients that allow our construction of coresets are the adaptation of the $\varepsilon$-coreset framework of sensitivity sampling, bounds on the VC dimension of approximations to the range spaces of balls under DTW, and new approximation algorithms for the $k$-median problem under DTW. We achieve our results by investigating approximations of DTW that provide a trade-off between the provided accuracy and amenability to known techniques. In particular, we observe that given $n$ curves under DTW, one can directly construct a metric that approximates DTW on this set, permitting the use of the wealth of results on metric spaces for clustering purposes. The resulting approximations are the first with polynomial running time and achieve a very similar approximation factor as state-of-the-art techniques. We apply our results to produce a practical algorithm approximating $(k,\ell)$-median clustering under DTW.
翻译:我们提出了在 $\mathbb{R}^d$ 中,基于 $p$-动态时间规整(DTW)距离的点序列 $k$-中位数聚类的 $\varepsilon$-核心集计算算法。DTW 下的核心集此前尚未被研究,且由于 DTW 并非度量,现有方法无法直接适用于其分析。我们构建核心集的三个主要要素包括:适应敏感性采样的 $\varepsilon$-核心集框架、DTW 下球近似范围空间VC维的界限,以及DTW下k-中位数问题的新近似算法。我们通过研究 DTW 的近似方法来实现结果,这些方法在提供准确性与对已知技术的适用性之间取得权衡。具体而言,我们观察到,给定 DTW 下的 $n$ 条曲线,可以直接构建一个近似该集合上 DTW 的度量,从而允许利用度量空间中的丰富成果进行聚类分析。所得近似是首个具有多项式运行时间的方法,其近似因子与现有最优技术高度相似。我们应用这些结果,提出了一种在 DTW 下近似 $(k,\ell)$-中位数聚类的实用算法。