Measuring distance or similarity between time-series data is a fundamental aspect of many applications including classification, clustering, and ensembling/alignment. Existing measures may fail to capture similarities among local trends (shapes) and may even produce misleading results. Our goal is to develop a measure that looks for similar trends occurring around similar times and is easily interpretable for researchers in applied domains. This is particularly useful for applications where time-series have a sequence of meaningful local trends that are ordered, such as in epidemics (a surge to an increase to a peak to a decrease). We propose a novel measure, DTW+S, which creates an interpretable "closeness-preserving" matrix representation of the time-series, where each column represents local trends, and then it applies Dynamic Time Warping to compute distances between these matrices. We present a theoretical analysis that supports the choice of this representation. We demonstrate the utility of DTW+S in several tasks. For the clustering of epidemic curves, we show that DTW+S is the only measure able to produce good clustering compared to the baselines. For ensemble building, we propose a combination of DTW+S and barycenter averaging that results in the best preservation of characteristics of the underlying trajectories. We also demonstrate that our approach results in better classification compared to Dynamic Time Warping for a class of datasets, particularly when local trends rather than scale play a decisive role.
翻译:测量时间序列数据之间的距离或相似性是分类、聚类和集成/对齐等应用中一个基础性问题。现有度量方法可能无法捕捉局部趋势(形状)之间的相似性,甚至会得出误导性结果。我们的目标是开发一种度量方法,用以寻找发生在相似时间附近的相似趋势,并且易于应用领域的研究者解释。这对具有有意义且有序局部趋势序列的时间序列应用尤其有用,例如在流行病中(激增到上升、峰值再到下降)。我们提出了一种新型度量方法DTW+S,该方法为时间序列创建了一种可解释的"保近"矩阵表示,其中每一列代表局部趋势,然后应用动态时间规整计算这些矩阵之间的距离。我们提供了支持该表示选择的理论分析。我们在多个任务中论证了DTW+S的实用性。对于流行病曲线的聚类,我们表明DTW+S是唯一能比基线方法产生良好聚类的度量。对于集成构建,我们提出DTW+S与重心平均的结合,实现了对底层轨迹特征的最佳保留。我们还证明,对于一类数据集,特别是当局部趋势而非尺度起决定性作用时,我们的方法相比动态时间规整能实现更好的分类。